從txt文件池中搜索ID

我已經提出了一個一個查詢的搜索問題，但是這個是一對多的文件搜索問題。我有一個包含一千個ID的query.txt文件。從txt文件池中搜索ID

query.txt 
GABDI004191 
GABDI007217 
GABDI004196 
GABDI008080 
.....

而且我有一組文件（file1.table，file2.table到file120.table）的其中包含與其他序列個人ID的搜索命中。

file1.table 
GABDI004191 c23504_g1_i1 29.38 160 100 2 1 160 90 530 
GABDI004191 c20415_g1_i1 45.21 73 39 1 180 252 27 242 
GABDI004191 c17483_g1_i1 88.78 98 11 0 20 117 1 294 
GABDI008080 c1407_g1_i1 95.56 45 2 0 112 156 200 66 9e-25 
GABDI004196 c2892_g1_i1 35.44 79 50 1 37 115 237 4 7e-08 

file2.table 
GABDI007217 TR9707|c0_g1_i1 32.47 77 49 2 1 77 309 88 
GABDI004196 TR9163|c0_g1_i1 63.77 69 25 0 315 383 207 1 
GABDI007217 TR1165|c0_g1_i1 91.56 154 12 1 1 153 464 3 
GABDI004191 TR4933|c0_g1_i1 91.56 154 12 1 1 153 35 496 
GABDI008080 TR16029|c0_g1_i1 32.20 118 77 2 37 152 242

我需要提取每個ID都從每個文件。表出現的線路，並將它們存儲在承載該ID的名稱單獨的文件。例如。

爲ID GABDI008080，它將有一個輸出文件GABDI008080.txt其中包含以下

GABDI008080 c1407_g1_i1 95.56 45 2 0 112 156 200 66 9e-25 
GABDI008080 TR16029|c0_g1_i1 32.20 118 77 2 37 152 242

和用於ID GABDI004191它將有一個輸出文件GABDI004191.txt其將包含以下內容

GABDI004191 c23504_g1_i1 29.38 160 100 2 1 160 90 530 
GABDI004191 c20415_g1_i1 45.21 73 39 1 180 252 27 242 
GABDI004191 c17483_g1_i1 88.78 98 11 0 20 117 1 294 
GABDI004191 TR4933|c0_g1_i1 91.56 154 12 1 1 153 35 496

我剛纔開始學習Python和Bash腳本。我試過下面的python代碼，但我被卡住了。

#!/bin/python 
import glob 
with open('query.txt' , 'r') as query_file: #reading in IDs from query  file 
    for id in query_file: 
    for file in glob.glob("*.table"): 
     with open(file, 'r') as one_file: #opening individual files for reading 
      for line in one_file: 
       if id in line: #trying to find IDs from each line in those files 
       idname=open(id +'.txt', 'w') #opening a file with the ID name where all found results for that ID is stored 
       idname.append(line) 
       idnam.close()

我將不勝感激任何幫助，使用Awk，或任何Shell腳本或Python。感謝

來源

2016-02-01 Uzo

你沒有提到你的120個表文件有多少行。另外，query.txt文件中的千位標識符代表所有不同標識符中的哪一部分表示？表格文件是按第一列排序的嗎？他們可以嗎？查詢文件非常小，可以排序並保證唯一性。 – tomc

使用bash，你可以做這樣的事情：

while IFS= read -r i; do 
    for f in file*.table; do 
    grep "^$i " "$f" >> "${i}.txt" 
    done 
done < query.txt

甚至更好，因爲你不需要知道在哪裏線來自：

while IFS= read -r i; do 
    grep "^$i " file*.table >> "${i}.txt" 
done < query.txt

來源

2016-02-01 09:16:09

@EdMorton我正在考慮複製/粘貼，但你是對的。代表清晰度更改var名稱。 Tks –

代替'for'循環，你可以使用'grep「^ $ i」file * .table' –

@AndreaCorbellini當然，編輯，謝謝=） –

在想這個應該爲你工作：

編輯：更正的代碼，因爲它不工作。現在，全功能

解釋：首先，我所有的代碼表加載到一個內部數組，然後，我打印每個文件的行，如果他們在代碼列表，我打印出來的一用每個代碼命名的文件。

cat: awk: No such file or directory 
$ ./awk.sh 
GABDI004191 
GABDI007217 
GABDI004196 
GABDI008080 
$ cat query.txt 
GABDI004191 
GABDI007217 
GABDI004196 
GABDI008080 
$ cat file1.table 
GABDI004191 c23504_g1_i1 29.38 160 100 2 1 160 90 530 
GABDI004191 c20415_g1_i1 45.21 73 39 1 180 252 27 242 
GABDI004191 c17483_g1_i1 88.78 98 11 0 20 117 1 294 
GABDI008080 c1407_g1_i1 95.56 45 2 0 112 156 200 66 9e-25 
GABDI004196 c2892_g1_i1 35.44 79 50 1 37 115 237 4 7e-08 
$ cat file2.table 
GABDI007217 TR9707|c0_g1_i1 32.47 77 49 2 1 77 309 88 
GABDI004196 TR9163|c0_g1_i1 63.77 69 25 0 315 383 207 1 
GABDI007217 TR1165|c0_g1_i1 91.56 154 12 1 1 153 464 3 
GABDI004191 TR4933|c0_g1_i1 91.56 154 12 1 1 153 35 496 
GABDI008080 TR16029|c0_g1_i1 32.20 118 77 2 37 152 242 
$ cat awk.sh 
awk 'BEGIN{ 
       while ((getline line < "query.txt") > 0) 
       {codeList[line]=line 
       print codeList[line] 
       } 
       close("query.txt") 
     } 
     $1 in codeList { print $0 > $1".txt"} 
' file*.table 
$ ./awk.sh 
GABDI004191 
GABDI007217 
GABDI004196 
GABDI008080 
$ ls *txt 
GABDI004191.txt GABDI004196.txt GABDI008080.txt query.txt 
$ cat GABDI004191.txt 
GABDI004191 c23504_g1_i1 29.38 160 100 2 1 160 90 530 
GABDI004191 c20415_g1_i1 45.21 73 39 1 180 252 27 242 
GABDI004191 c17483_g1_i1 88.78 98 11 0 20 117 1 294 
$

希望它能幫助。

來源

2016-02-01 09:24:18

awk 'NR==FNR{ids[$0];next} $1 in ids{print > ($1".txt")}' query.txt *.table

如果你得到關於有太多的文件錯誤消息同時打開，然後獲取/使用GNU AWK，因爲它處理的是你內部，否則的話，證明不可能做那麼print...之後添加close($1".txt")。

shell是一個可以操作（例如，創建/移動/銷燬）文件和進程並將調用順序調用到工具的環境。用於處理文本的標準UNIX工具是awk，所以無論何時您需要在UNIX中操作文本，您都應該編寫awk腳本並從shell中調用它。閱讀Arnold Robbins編寫的「有效的Awk編程」第4版。

來源

2016-02-01 13:00:03

非常感謝Ed，我在awk中發現了這本書，並且我已經學到了很多東西。 – Uzo

從txt文件池中搜索ID

回答

相關問題