使用grep和模式文件來統計文件中的單個模式匹配

我正在使用grep和一個具有多種搜索模式的文件。作爲輸出，我希望獲得匹配的模式和特定模式的出現次數。使用grep和模式文件來統計文件中的單個模式匹配

cat pattern.txt 

AT3G09260.1 
AT5G50920.1

輸入文件看起來像這樣

>AT2G44750.1 | Symbols: TPK2 | thiamin pyrophosphokinase 2 | chr2:18451510-18452754 FORWARD LENGTH=265 
>AT2G47140.1 | Symbols: | NAD(P)-binding Rossmann-fold superfamily protein | chr2:19350970-19352059 REVERSE LENGTH=257 
>AT2G47120.1 | Symbols: | NAD(P)-binding Rossmann-fold superfamily protein 
>AT1G21470.1 | Symbols: | BEST Arabidopsis thaliana protein match is: CLPC homologue 1 (TAIR:AT5G50920.1); Has 326 Blast hits to 324 proteins in 95 species: Archae - 0; Bacteria - 130; Metazoa - 0; Fungi - 0; Plants - 67; Viruses - 0; Other Eukaryotes - 129 (source: NCBI BLink). | chr1:7516709-7517179 REVERSE LENGTH=118 
>AT3G09260.1 | Symbols: PYK10, PSR3.1, BGLU23, LEB | Glycosyl hydrolase superfamily protein | chr3:2840657-2843730 REVERSE LENGTH=524 
>AT5G48175.1 | Symbols: | FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: endomembrane system; EXPRESSED IN: hypocotyl, male gametophyte, root; BEST Arabidopsis thaliana protein match is: Glycosyl hydrolase superfamily protein (TAIR:AT3G09260.1); Has 30201 Blast hits to 17322 proteins in 780 species: Archae - 12; Bacteria - 1396; Metazoa - 17338; Fungi - 3422; Plants - 5037; Viruses - 0; Other Eukaryotes - 2996 (source: NCBI BLink). | chr5:19539208-19539676 FORWARD LENGTH=115 
>AT5G50920.1 | Symbols: CLPC, ATHSP93-V, HSP93-V, DCA1, CLPC1 | CLPC homologue 1 | chr5:20715710-20719800 REVERSE LENGTH=929

我想獲得像

AT3G09260.1 2 
AT5G50920.1 2

我已經試過

grep -f pattern.txt -c inputfile.txt 
4

但只給了我匹配行的總數（fo所有模式）。我相信這個問題是已經在這裏問，但從來沒有得到解決

how to loop over pattern from a file with grep

謝謝。

來源

2017-10-20 marie

爲什麼寫*，但從來沒有得到解決* ？該問題已被回答 – RomanPerekhrest

提供的awk腳本沒有給出所需的輸出 – marie

您基本上需要grep -o這將只打印匹配的組，然後您可以簡單地使用排序和uniq像這樣找到它們的計數

$ grep -of pattern_file input_file | sort | uniq -c 
     2 AT3G09260.1 
     2 AT5G50920.1

如果您想要的順序來交換，那麼你可以使用awk這樣的：

$ grep -of pattern_file input_file | sort | uniq -c | awk '{print $2,$1}' 
AT3G09260.1 2 
AT5G50920.1 2

，或者乾脆利用AWK

$ awk 'FNR==NR{a[$1]=0; next} { for(i in a) {a[i]+=gsub(i,"")} } END{for(i in a){ print i, a[i]} }' pattern_file RS= input_file 
AT5G50920.1 2 
AT3G09260.1 2

來源

2017-10-20 11:01:42 batMan

偉大的，grep和awk完美工作。謝謝。 – marie

以下awk可以幫助你，因爲你的Input_file看起來沒有多行計數，所以無法測試你的輸出。

awk '{a[$0]++} END{for(i in a){print i,a[i]}}' Input_file

來源

2017-10-20 10:21:50 RavinderSingh13

嘗試

grep -f pattern.txt inputfile.txt| cut -d'|' -f1 |sort | uniq -c

這將從您的文件匹配的行的grep，然後提取ID（第一管道符號之前的一切，對它們進行排序，然後計算每個獨特事件。

來源

2017-10-20 10:26:34 user1717259

grep返回無法輕鬆排序的整行，因此它不起作用 - 將編輯我的問題並輸入文件的詳細信息它更清晰 – marie

使用grep和模式​​文件來統計文件中的單個模式匹配

回答

相關問題

使用grep和模式文件來統計文件中的單個模式匹配