使用grep或awk匹配文本

我遇到了grep和awk的問題。我認爲這是因爲我的輸入文件包含看起來像代碼的文本。使用grep或awk匹配文本

的輸入文件中包含的ID名稱，看起來像這樣：

SNORD115-40 
MIR432 
RNU6-2

參考文件看起來是這樣的：

Ensembl Gene ID HGNC symbol 
ENSG00000199537 SNORD115-40 
ENSG00000207793 MIR432 
ENSG00000266661 
ENSG00000243133 
ENSG00000207447 RNU6-2

我想從我的源文件中的ID名稱與我的參考匹配文件並打印出相應的身份證號碼，以便輸出文件如下所示：

ENSG00000199537 SNORD115-40 
ENSG00000207793 MIR432 
ENSG00000207447 RNU6-2

我已經試過這個循環：

exec < source.file 
while read line 
do 
grep -w $line reference.file > outputfile 
done

我也試過用awk

awk 'NF == 2 {print $0}' reference file 
awk 'NF >2 {print $0}' reference file

與參考文件播放左右，但我只得到grep'd ID之一。

任何建議或更簡單的方法，這樣做會很好。

來源

2013-05-09 user1879573

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40 
ENSG00000207793 MIR432 
ENSG00000207447 RNU6-2

fgrep相當於grep -F：

-F, --fixed-strings 
      Interpret PATTERN as a list of fixed strings, separated by 
      newlines, any of which is to be matched. (-F is specified by 
      POSIX.)

的-f選項是從文件採取PATTERN：

-f FILE, --file=FILE 
      Obtain patterns from FILE, one per line. The empty file 
      contains zero patterns, and therefore matches nothing. (-f is 
      specified by POSIX.)

如註釋中所述，如果reference.file中的ID包含source.file中的ID作爲子字符串，則可能產生誤報。你可以在飛行中構建grep一個更明確的圖案sed：

grep -f <(sed 's/.*/ &$/' input.file) reference.file

但這樣的模式被解釋爲正則表達式而不是固定的字符串，這是潛在的脆弱（儘管可能是如果ID OK只包含字母數字字符）。更好的方法，雖然（感謝@sidharthcnadhan），是使用-w選項：

-w, --word-regexp 
      Select only those lines containing matches that form whole 
      words. The test is that the matching substring must either be 
      at the beginning of the line, or preceded by a non-word 
      constituent character. Similarly, it must be either at the end 
      of the line or followed by a non-word constituent character. 
      Word-constituent characters are letters, digits, and the 
      underscore.

所以最終的回答你的問題是：

grep -Fwf source.file reference.file

來源

2013-05-09 09:09:36

這會產生誤報即'輸入文件SNORD115-40'也將匹配'SNORD115-401'的參考等。 – 2013-05-09 09:15:00

@sudo_O好點，謝謝 – 2013-05-09 09:22:13

我們可以使用「fgrep -wf source.file reference.file」來避免誤報。 – 2013-05-09 09:30:03

這將這樣的伎倆：

$ awk 'NR==FNR{a[$0];next}$NF in a{print}' input reference 
ENSG00000199537 SNORD115-40 
ENSG00000207793 MIR432 
ENSG00000207447 RNU6-2

來源

2013-05-09 09:07:32

這是一個不錯的bash十歲上下的嘗試。問題是你總是覆蓋結果文件。使用「>>」而不是>或移動>後done

grep -w $line reference.file >> outputfile

或

done > outputfile

但它啓動一個外部進程只有一次，我寧願列弗的解決方案。

如果你想解決它在純bash，你可以試試這個：

ID=($(<IDfile)) 

while read; do 
    for((i=0;i<${#ID[*]};++i)) { 
     [[ $REPLY =~ [[:space:]]${ID[$i]}$ ]] && echo $REPLY && break 
    } 
done <RefFile >outputfile 

cat outputfile

輸出：

ENSG00000199537 SNORD115-40 
ENSG00000207793 MIR432 
ENSG00000207447 RNU6-2

較新的bash支持關聯數組。它可用於簡化和加快了重點搜索：

declare -A ID 
for i in $(<IDfile); { ID[$i]=1;} 

while read v; do 
    [[ $v =~ [[:space:]]([^[:space:]]+)$ && ${ID[${BASH_REMATCH[1]}]} = 1 ]] && echo $v 
done <RefFile

來源

2013-05-09 13:34:10 TrueY

使用grep或awk匹配文本

回答

相關問題