2013-02-20 220 views
0

批量重命名文件和文件夾是一個經常被問到的問題,但經過一番搜索之後,我認爲沒有一個類似於我的。使用「索引」重命名批量(基本名稱)文件/文件夾

背景:我們派一些生物樣品返回具有獨特名稱的文件和文本格式包含表服務供應商,其中包括信息,文件名和源自它的樣本:

head samples.txt 
fq_file Sample_ID Sample_name Library_ID FC_Number Track_Lanes_Pos 
L2369_Track-3885_R1.fastq.gz S1746_B_7_t B 7 t L2369_B_7_t 163 6 
L2349_Track-3865_R1.fastq.gz S1726_A_3_t A 3 t L2349_A_3_t 163 5 
L2354_Track-3870_R1.fastq.gz S1731_A_GFP_c A GFP c L2354_A_GFP_c 163 5 
L2377_Track-3893_R1.fastq.gz S1754_B_7_c B 7 c L2377_B_7_c 163 7 
L2362_Track-3878_R1.fastq.gz S1739_B_GFP_t B GFP t L2362_B_GFP_t 163 6 

目錄結構(34個目錄):

L2369_Track-3885_ 
    accepted_hits.bam  
    deletions.bed 
    junctions.bed   
    logs 
    accepted_hits.bam.bai 
    insertions.bed 
    left_kept_reads.info 
L2349_Track-3865_ 
    accepted_hits.bam  
    deletions.bed 
    junctions.bed   
    logs 
    accepted_hits.bam.bai 
    insertions.bed 
    left_kept_reads.info 

目標:因爲文件名是毫無意義的,很難解釋,我要重命名.bam結束(保持後綴)的文件和文件夾與通信樣品名稱,以更合適的方式重新排序。結果應該是這樣的:

7_t_B 
    7_t_B..bam  
    deletions.bed 
    junctions.bed   
    logs 
    7_t_B.bam.bai 
    insertions.bed 
    left_kept_reads.info 
3_t_A 
    3_t_A.bam  
    deletions.bed 
    junctions.bed   
    logs 
    accepted_hits.bam.bai 
    insertions.bed 
    left_kept_reads.info 

我砍死在一起使用bash和python(新手)的解決方案,但感覺過度設計。問題是,是否有更簡單/更優雅的方式來實現這一點,我錯過了?解決方案可以使用python,bash和R.也可以awk,因爲我正在嘗試學習它。作爲一個相對的初學者確實會讓事情變得複雜。

這是我的解決方案:

的包裝紙把它全部到位,並給出了工作流程的一個想法:

#! /bin/bash 

# select columns of interest and write them to a file - basenames 
tail -n +2 samples.txt | cut -d$'\t' -f1,3 >> BAMfilames.txt 

# call my little python script that creates a new .sh with the renaming commmands 
./renameBamFiles.py 

# finally do the renaming 
./renameBam.sh 

# and the folders to 
./renameBamFolder.sh 

renameBamFiles.py:

#! /usr/bin/env python 
import re 

# Read in the data sample file and create a bash file that will remane the tophat output 
# the reanaming will be as follows: 
# mv L2377_Track-3893_R1_ L2377_Track-3893_R1_SRSF7_cyto_B 
# 

# Set the input file name 
# (The program must be run from within the directory 
# that contains this data file) 
InFileName = 'BAMfilames.txt' 


### Rename BAM files 

# Open the input file for reading 
InFile = open(InFileName, 'r') 


# Open the output file for writing 
OutFileName= 'renameBam.sh' 

OutFile=open(OutFileName,'a') # You can append instead with 'a' 

OutFile.write("#! /bin/bash"+"\n") 
OutFile.write(" "+"\n") 


# Loop through each line in the file 
for Line in InFile: 
    ## Remove the line ending characters 
    Line=Line.strip('\n') 

    ## Separate the line into a list of its tab-delimited components 
    ElementList=Line.split('\t') 

    # separate the folder string from the experimental name 
    fileroot=ElementList[1] 
    fileroot=fileroot.split() 

    # create variable names using regex 
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0]) 
    folderName=folderName.strip('\n') 
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0]) 

    command= "for file in %s/accepted_hits.*; do mv $file ${file/accepted_hits/%s}; done" % (folderName, fileName) 

    print command 
    OutFile.write(command+"\n") 


# After the loop is completed, close the files 
InFile.close() 
OutFile.close() 


### Rename folders 

# Open the input file for reading 
InFile = open(InFileName, 'r') 


# Open the output file for writing 
OutFileName= 'renameBamFolder.sh' 

OutFile=open(OutFileName,'w') 

OutFile.write("#! /bin/bash"+"\n") 
OutFile.write(" "+"\n") 


# Loop through each line in the file 
for Line in InFile: 
    ## Remove the line ending characters 
    Line=Line.strip('\n') 

    ## Separate the line into a list of its tab-delimited components 
    ElementList=Line.split('\t') 

    # separate the folder string from the experimental name 
    fileroot=ElementList[1] 
    fileroot=fileroot.split() 

    # create variable names using regex 
    folderName=re.sub(r'^(.*)(\_)(\w+).*', r'\1\2\3\2', ElementList[0]) 
    folderName=folderName.strip('\n') 
    fileName = "%s_%s_%s" % (fileroot[1], fileroot[2], fileroot[0]) 

    command= "mv %s %s" % (folderName, fileName) 

    print command 

    OutFile.write(command+"\n") 


# After the loop is completed, close the files 
InFile.close() 
OutFile.close() 

RenameBam.sh - 由以前的python腳本創建:

#! /bin/bash 

for file in L2369_Track-3885_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/7_t_B}; done 
for file in L2349_Track-3865_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/3_t_A}; done 
for file in L2354_Track-3870_R1_/accepted_hits.*; do mv $file ${file/accepted_hits/GFP_c_A}; done 
(..) 

重命名renameBamFolder.sh非常相似:

mv L2369_Track-3885_R1_ 7_t_B 
mv L2349_Track-3865_R1_ 3_t_A 
mv L2354_Track-3870_R1_ GFP_c_A 
mv L2377_Track-3893_R1_ 7_c_B 

自從我學習,我覺得的這樣做,並思考如何做到這一點的不同方法的一些示例,將是非常有用的。

+2

使用Python生成bash似乎有點沒有意義。我會說選擇一種語言或其他語言,然後使用它。如果你不習慣,Python也許不那麼神祕。 – 2013-02-20 13:11:30

回答

2

一個簡單的方法:

find . -type d -print | 
while IFS= read -r oldPath; do 

    parent=$(dirname "$oldPath") 
    old=$(basename "$oldPath") 
    new=$(awk -v old="$old" '$1~"^"old{print $4"_"$5"_"$3}' samples.txt) 

    if [ -n "$new" ]; then 
     newPath="${parent}/${new}" 
     echo mv "$oldPath" "$newPath" 
     echo mv "${newPath}/accepted_hits.bam" "${newPath}/${new}.bam" 
    fi 
done 

初步測試後刪除 「回聲」 S得到它實際做的 「MV」 S。

如果所有的目標目錄都在@ triplee的答案所暗示的一個級別,那麼它就更簡單了。只是cd到它們的父目錄,並做:

awk 'NR>1{sub(/[^_]+$/,"",$1); print $1" "$4"_"$5"_"$3}' samples.txt | 
while read -r old new; do 
    echo mv "$old" "$new" 
    echo mv "${new}/accepted_hits.bam" "${new}/${new}.bam" 
done 

在您的預期產出之一,你改名爲「.bai」文件,在對方你沒有,如果你想這樣做,你不說或不。如果你想重新命名,只需添加

echo mv "${new}/accepted_hits.bam.bai" "${new}/${new}.bam.bai" 

以上任何你喜歡的解決方案。

+0

awk解決方案是迄今爲止最優雅的imo,即使我從未學過awk,但我直觀地設法根據您的解決方案更改腳本以重命名另一個類似的一組文件。您可能會考慮在您的解決方案中改變@EdMorton的唯一方法是字段順序:打印$ 1「」$ 4「_」$ 5「_」$ 3應該打印$ 1「」$ 3「_」$ 4「_」$ 2。非常感謝。 – fridaymeetssunday 2013-02-22 14:46:26

0

當然,你只能在Python中完成 - 它可以產生一個小的可讀腳本。

第一件事:閱讀sampels.txt文件並創建一個從現有文件前綴到所需映射前綴的映射 - 該文件未格式化爲使用Python CSV閱讀器模塊,因爲在最後的數據中使用了列分隔符柱。

mapping = {} 
with open("samples.txt") as samples: 
    # throw away headers 
    samples.readline() 
    for line in samples(): 
     # separate the columns spliting the first whitespace ocurrences: 
     # (either space sequences or tabs) 
     fields = line.split() 
     # skipp blank, malformed lines: 
     if len(fields) < 6: 
      continue 
     fq_file, sample_id, Sample_name, Library_ID, FC_Number, track_lanes_pos, *other = fields 
     # the [:-2] part is to trhow awauy the "R1" sufix as for the example above 
     file_prefix = fq_file.split(".")[0][:-2] 
     target_id = "_".join((Library_ID, FC_number. Sample_name)) 
     mapping[file_prefix] = target_id 

然後檢查dir名稱,並在每個名稱中添加「.bam」文件以進行重新映射。在bash

import os 
for entry in os.listdir("."): 
    if entry in mapping: 
     dir_prefix = "./" + entry + "/") 
     for file_entry in os.listdir(dir_prefix): 
       if ".bam" in file_entry: 
        parts = file_entry.split(".bam") 
        parts[0] = mapping[entry] 
        new_name = ".bam".join(parts) 

        os.rename(dir_prefix + file_entry, dir_prefix + new_name) 
     os.rename(entry, mapping[entry]) 
0

似乎只需從簡單的while循環中的索引文件中讀取必需的字段即可。文件的結構並不明顯,所以我假定文件是空格分隔的,並且Sample_Id實際上是四個字段(複雜的sample_id,然後是名稱中的三個組件)。也許你在Sample_Id字段中有一個帶有內部空格的製表符分隔的文件?無論如何,如果我的假設是錯誤的,這應該很容易適應。

# Skip the annoying field names 
tail +1 samples.txt | 
while read fq _ c a b chaff; do 
    dir=${fq%R1.fastq.gz} 
    new="${a}_${b}_$c" 
    echo mv "$dir"/accepted_hits.bam "$dir/$new".bam 
    echo mv "$dir"/accepted_hits.bam.bai "$dir/$new".bam.bai 
    echo mv "$dir" "$new" 
done 

取出echo■如果輸出看起來像你想要什麼。

+0

該文件是製表符分隔的,有些字段中有空格。這就是爲什麼OP在腳本中使用'cut -d $'\ t''的原因。如果您在問題上點擊「編輯」,您將看到選項卡。 – dogbane 2013-02-20 14:13:57

+0

我很抱歉@tripleee,這些字段是製表符分隔的,我使用字段Sample_name,但它不一定是那個特定的字段。 – fridaymeetssunday 2013-02-20 14:17:14

0

這是使用shell腳本的一種方法。運行像:的script.sh

script.sh /path/to/samples.txt /path/to/data 

內容:

# add directory names to an array 
while IFS= read -r -d '' dir; do 

    dirs+=("$dir") 

done < <(find $2/* -type d -print0) 


# process the sample list 
while IFS=$'\t' read -r -a list; do 

    for i in "${dirs[@]}"; do 

     # if the directory is in the sample list 
     if [ "${i##*/}" == "${list[0]%R1.fastq.gz}" ]; then 

      tag="${list[3]}_${list[4]}_${list[2]}" 
      new="${i%/*}/$tag" 
      bam="$new/accepted_hits.bam" 

      # only change name if there's a bam file 
      if [ -n $bam ]; then 

       mv "$i" "$new" 
       mv "$bam" "$new/$tag.bam" 
      fi 
     fi 
    done 

done < <(tail -n +2 $1) 
0

雖然這是你要找不正是對(只是想禁區外):你可能會考慮你的文件的替代「視圖」系統 - 使用像數據庫視圖這樣的術語「視圖」就是表格。您可以通過FUSE中的「用戶空間中的文件系統」來完成此操作。人們可以用許多現有的工具來做到這一點,但我不知道只是一般地處理任何一組文件,專門用於重命名/重新組織。但作爲如何使用它的具體示例,pytagsfs根據您定義的規則創建virtual (fuse) file system,使您可以顯示文件的目錄結構。 (也許這也適用於你 - 但pytagsfs實際上是用於媒體文件的。)然後,你只需使用任何通常訪問該數據的程序在該(虛擬)文件系統上進行操作。或者,爲了使虛擬目錄結構永久化(如果pytagsfs沒有選項可以執行此操作),只需將虛擬文件系統複製到另一個目錄(虛擬文件系統之外)。

相關問題