2014-07-07 73 views
0

我想打印匹配的搜索模式,然後計算平均行。最好將是一個expample:在awk中打​​印搜索模式

輸入文件:

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280 
chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282 
chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285 
chr17 41275978 41276294 BRCA1_ex02_04 286 
chr17 41275978 41276294 BRCA1_ex02_04 287 
chr17 41275978 41276294 BRCA1_ex02_04 288 

我在bash循環(例如)一樣的第四列瓦納提取物:

OUTPUT1:

chr17 41275978 41276294 BRCA1_ex02_01 278 
chr17 41275978 41276294 BRCA1_ex02_01 279 
chr17 41275978 41276294 BRCA1_ex02_01 280 

OUTPUT2 :

chr17 41275978 41276294 BRCA1_ex02_02 281 
chr17 41275978 41276294 BRCA1_ex02_02 282 

OUTPUT3:

chr17 41275978 41276294 BRCA1_ex02_03 283 
chr17 41275978 41276294 BRCA1_ex02_03 284 
chr17 41275978 41276294 BRCA1_ex02_03 285 

的等等。然後計算平均爲第五列是很容易的:

AWK 'END {總和+ = $ 5} {打印NR /總和}' in_file.txt

在我的情況下,有數千行BRCA1_exXX_XX - 所以任何想法熱分裂它?

Paul。

回答

1

假設項目分別由4列在給定的數據進行排序,你可以做這樣的:

awk ' 

    $4 != prev {    # if this line's 4th column is different from the previous line 
    if (cnt > 0)   # if count of lines is greater than 0 
     print prev, sum/cnt # print the average 
    prev = $4    # save previous 4th column 
    sum = $5    # initialize sum to column 5 
    cnt = 1     # initialize count to 1 
    next     # go to next line 
    } 

    { 
    sum += $5    # accumulate total of 5th column 
    ++cnt     # increment count of lines 
    } 

    END { 
    if (cnt > 0)    # if count > 0 (avoid divide by 0 on empty file) 
     print prev, sum/cnt # print the average for the last line 
    } 

' file 
+0

這假設條目總是按順序排列的。 –

+0

Wau它看起來可以工作:-)謝謝!有可能解釋嗎?我可以添加到第三列標準偏差值嗎? – Geroge

+0

@EtanReisner是的,它假定條目按第4列排序,如給定數據中所示。 – ooga

2

我認爲這會做你想要什麼。

awk '{ 
    # Keep running sum of fifth column based on value of fourth column. 
    v[$4]+=$5; 
    # Keep count of lines with similar fourth column values. 
    n[$4]++ 
} 

END { 
    # Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns. 
    for (val in n) { 
     print val ": " v[val]/n[val] 
    } 
}' $file 
+0

切勿將字母'l'用作變量名,因爲它看起來太像數字'1'。在某些字體中完全無法區分。 –

+1

@EdMorton不夠公平。我用它來代表「線」,但在這方面也沒有什麼意義。編輯。 –

+0

是的,這太棒了 - 它工作得很好。謝謝你的解釋! – Geroge