2017-09-14 23 views
0

在下面的awk中,我試圖將:p.=添加到每個$7,但前提是它們的模式爲/NM/。下面似乎這樣做,如果$7中只有一個NM,就像第2行。但是,如果$7中有多個NM,就像第3行那麼:p.=只會被添加到最後。 A ;用於在現場分離多個NM。我添加了評論,但我不確定我沒有做什麼,那是需要的。謝謝 :)。awk向字段中的每個模式添加文本

輸入tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G ISG15 
4 chr1 949654 949654 A G . ISG15 

AWK

awk ' 
    BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing 
    $7 ~ /NM/ {   # look for pattern NM in $7 
     # split $7 by ";" and cycle through them 
      i=split($7,NM,";") 
      for (n=1; n<=i; n++) { 
       sub("$", ":p=", $7) # add :p. to end off each $7 before the ; 
    }  # close block 
}1' input # define input file 

電流輸出tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G:p.=p.= ISG15 
4 chr1 949654 949654 A G . ISG15 

所需的輸出tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C:p.=;NM_005101:c.-84C>G:p.= ISG15 
4 chr1 949654 949654 A G . ISG15 
+3

誰與這些可怕的形式出現?它既不是機器也不是人類友好的。 – karakfa

+0

對不起,我試圖縮進代碼更具可讀性,但不幸的是文件類型是這樣從儀器來....我想也許我應該在Excel中查看它可能會幫助。謝謝 :)。 – Chris

+2

':p。='在領帶上流口水? :D –

回答

2

替換此:

 i=split($7,NM,";") 
     for (n=1; n<=i; n++) { 
      sub("$", ":p=", $7) # add :p. to end off each $7 before the ; 
     } 

與此:

 out="" 
     i=split($7,NM,/;/) 
     for (n=1; n<=i; n++) { 
      sub(/$/, ":p=", NM[i]) # add :p. to end off each NM[i] before the ; 
      out = (out=="" ? "" : out";") NM[i] 
     } 
     $7 = out 
相關問題