如何統計標籤分隔值文件中字符串的實例？

如何統計標籤分隔值（tsv）文件中字符串的實例？如何統計標籤分隔值文件中字符串的實例？

的TSV文件有數億行的，每一個都是形式

foobar1 1 xxx yyy 
foobar1 2 xxx yyy 
foobar2 2 xxx yyy 
foobar2 3 xxx yyy 
foobar1 3 xxx zzz

的。如何計算文件中整個第二列中每個唯一整數的實例，並理想地將計數添加爲每行中的第五個值？

foobar1 1 xxx yyy 1 
foobar1 2 xxx yyy 2 
foobar2 2 xxx yyy 2 
foobar2 3 xxx yyy 2 
foobar1 3 xxx zzz 2

我更喜歡只使用UNIX命令行流處理程序的解決方案。

來源

2012-05-05 qazwsx

請粘貼一些示例數據和您的期望輸出。 – Kent

我不完全清楚你想要做什麼。是否要根據第二列的值作爲第五列添加0/1，還是想要獲得第二列中的值的分佈，整個文件的總數是多少？

在第一種情況下，請使用類似awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file的東西。

在第二種情況下，請使用類似awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file的東西。

來源

2012-05-05 19:00:12

第二種情況是我相信是想要的，但它會通過文件第二遍來將計數附加到每行的末尾。你可以隨心所欲地做到這一點，但複雜性會增加，實質上它仍然是兩次。 –

使用數組'h [$ 2]'的適用性取決於最大整數的大小？沒有檢查，但可能第二列中的某個整數可能大於最大機器編號。 – qazwsx

如果是這樣，你至少應該得到一個錯誤信息。 –

使用perl的一種解決方案假設第二列的值是排序的，我的意思是，當找到值爲2時，具有相同值的所有行將是連續的。該腳本一直線，直到找到在第二列不同的值，獲取計數，打印出來並釋放內存，所以不應該不管有多大的輸入文件產生了一個問題：

內容的 script.pl

：

use warnings; 
use strict; 

my (%lines, $count); 

while (<>) { 

    ## Remove last '\n'. 
    chomp; 

    ## Split line in spaces. 
    my @f = split; 

    ## Assume as malformed line if it hasn't four fields and omit it. 
    next unless @f == 4; 

    ## Save lines in a hash until found a different value in second column. 
    ## First line is special, because hash will always be empty. 
    ## In last line avoid reading next one, otherwise I would lose lines 
    ## saved in the hash. 
    ## The hash will ony have one key at same time. 
    if (exists $lines{ $f[1] } or $. == 1) { 
     push @{ $lines{ $f[1] } }, $_; 
     ++$count; 
     next if ! eof; 
    } 

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value. 

    ## The value of the second column will be the key of the hash, get it now. 
    my ($key) = keys %lines; 

    ## Read each line of the hash and print it appending the repeated lines as 
    ## last field. 
    while (@{ $lines{ $key } }) { 
     printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count; 
    } 

    ## Clear hash. 
    %lines =(); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file. 
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1; 
}

內容infile：

foobar1 1 xxx yyy 
foobar1 2 xxx yyy 
foobar2 2 xxx yyy 
foobar2 3 xxx yyy 
foobar1 3 xxx zzz

運行它想：

perl script.pl infile

以下輸出：

foobar1 1 xxx yyy 1 
foobar1 2 xxx yyy 2 
foobar2 2 xxx yyy 2 
foobar2 3 xxx yyy 2 
foobar1 3 xxx zzz 2

來源

2012-05-08 08:41:40 Birei

如何統計標籤分隔值文件中字符串的實例？

回答

相關問題