2014-03-13 43 views
0

專家,用不同的分隔符和位置提取多個列

我有一個問題。我有一個包含多個列和行的大數據文件。第一對數列由製表符分隔符分隔,第二部分用「;」分隔。我想提取前五列。而從「;」將EUR_AF =欄和AF =分隔開的部分並將其放置在新文件中。文件的

實施例(2行):

13 19020013 rs181615907 C T 100 PASS AA=.;AC=83;AF=0.12;AFR_AF=0.05;AMR_AF=0.15;AN=758;ASN_AF=0.17;AVGPOST=0.8701;ERATE=0.0007;EUR_AF=0.11;LDAF=0.1423;RSQ=0.6009;SNPSOURCE=LOWCOV;THETA=0.0051;VT=SNP 
13 19020047 rs186129910 A . 100 PASS AA=.;AC=0;AF=0.0005;AFR_AF=0.0020;AN=758;AVGPOST=0.9992;ERATE=0.0005;LDAF=0.0008;RSQ=0.4992;SNPSOURCE=LOWCOV;THETA=0.0112;VT=SNP 
13 19020095 rs140871821 C T 100 PASS AA=.;AC=38;AF=0.05;AFR_AF=0.08;AMR_AF=0.05;AN=758;ASN_AF=0.03;AVGPOST=0.9904;ERATE=0.0005;EUR_AF=0.05;LDAF=0.0538;RSQ=0.9245;SNPSOURCE=LOWCOV;THETA=0.0069;VT=SNP 

我嘗試這樣做:

awk -F'[\t;]' ' NR > 30 { 
    for (i = 1; i <= NF; i++) { 
     if ($i ~ /EUR_AF/) { 
     printf $1 " " $2 " " $3 " " $4 " " $5 " " $10 " " "%s ", $i 
     } 
    } 
    print "" 
}' head50.txt 

輸出:

13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 

13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05 
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73 
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09 
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71 

問題: 現在有缺少的行(第二個)EUR_AF部分未填充。我希望看到這些行以及第二個參數見下文:

13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 
13 19020047 rs186129910 A . AF=0.0005 
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05 
13 19020145 rs57048904 G T AF=0.61 EUR_AF=0.73 
13 19020341 rs184229798 C T AF=0.03 EUR_AF=0.09 
13 19020627 rs12018140 A G AF=0.70 EUR_AF=0.71 

希望有人能幫助我。

在此先感謝。

回答

0

這是一個聰明的方式來獲得你想要的東西:

awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,"EUR_AF="b[2]+0}' file 
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 
13 19020047 rs186129910 A . AF=0.0005 EUR_AF=0 
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05 

它將打印EUR_AF=0線路2,因爲不存在了。

如果你不喜歡它打印出來,你可以測試它:

awk '{split($8,a,";AF=");split($8,b,";EUR_AF=");print $1,$2,$3,$4,$5,"AF="a[2]+0,(b[2]?"EUR_AF="b[2]+0:"")}' file 
13 19020013 rs181615907 C T AF=0.12 EUR_AF=0.11 
13 19020047 rs186129910 A . AF=0.0005 
13 19020095 rs140871821 C T AF=0.05 EUR_AF=0.05