2014-06-18 38 views
3

我有一個文件如下。我想統計每個角色的數量。統計文件中的殘差數

>1DMLA 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK 
>1DMLB 
DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA 
>2BHDC 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 

我試了下面的代碼。

awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file 

上述代碼的輸出是

1DMLA - 80 
1DMLA - 80 
1DMLA - 80 
1DMLA - 79 
1DMLB - 36 
2BHDC - 80 
2BHDC - 80 
2BHDC - 80 

我期望的輸出是

1DMLA - 319 
1DMLB - 36 
2BHDC - 240 

如何改變讓我的期望輸出上面的代碼?

+0

最好避免' Steve

+0

你測試過所有的解決方案嗎? – klashxx

回答

0

下面是使用awk單程:

awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file 

結果:

1DMLA - 319 
1DMLB - 36 
2BHDC - 240 
0

這樣:

awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile 

或格式化:

awk -F\> '/^>/ { if (seqlen != "") 
        print seqlen 
       printf("%s - ",$2) 
       seqlen=0 
       next } 
      seqlen != ""{seqlen+=length($0)} 
      END{ 
      print seqlen}' infile 

見: Sequence length of FASTA file

從預期的結果

除此之外,這將處理這些意外的文件格式。

$ cat infile 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK 
>1DMLB 
>2BHDC 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 


$ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2 
1DMLB - 0 
2BHDC - 240 
0
awk -vRS='>' '$1{gsub("[\r]", "",$1); 
       printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input 
+0

您能否詳細介紹一下您做了哪些更改以及這些更改對未來參考的作用? –