我有一個名爲aa_seq幾百個氨基酸序列的列表中，它看起來像這樣：[「AFYIVHPMFSELINFQNEGHECQCQCG」，「KVHSLPGMSDNGSPAVLPKTEFNKYKI」，「RAQVEDLMSLSPHVENASIPKGSTPIP」，「TSTNNYPMVQEQAILSCIEQTMVADAK」 ,. ..]。每個序列長度爲27個字母。我必須確定每個位置（1-27）最常用的氨基酸和頻率。增加計數器作爲一個字典值的循環

到目前爲止，我有：

count_dict = {} 
    counter = count_dict.values() 
    aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , #one-letter code for amino acids 
     'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y'] 
    for p in range(0,26):      #first round:looks at the first position in each sequence 
     for s in range(0,len(aa_seq)):   #goes through all sequences of the list 
      for item in aa_list:    #and checks for the occurrence of each amino acid letter (=item) 
        if item in aa_seq[s][p]: 
         count_dict[item]   #if that letter occurs at the respective position, make it a key in the dictionary 
         counter += 1    #and increase its counter (the value, as definded above) by one 
    print count_dict

它說KeyError異常： 'A'，它的指向線count_dict [項目]。所以aa_list的項目顯然不能用這種方式添加爲關鍵字..？我怎麼做？它也給出了一個錯誤，「'int'對象不可迭代」關於計數器。如何可以增加櫃檯？

來源

2017-04-09 ccaarroo

什麼是你想用'count_dict [項目]'？即使該詞典中存在「item」，只要查找該值並立即將其丟棄;你不會在那裏分配任何東西。 –

另外，'counter'被定義爲count_dict開始時的值列表;它是一個空列表，因爲count_dict是空的。所以'counter + = 1'沒有意義，因爲你不能在列表中添加一個整數。 –

與像C++這樣的語言不同，您可以簡單地引用它們來初始化字典（映射）條目，但在python中，您需要顯式初始化字典條目。 – Unlocked

將項目添加到dictionnary，你必須將其初始化爲值：

if item not in count_dict: 
    count_dict[item]=0

可以使用setdefault函數來執行這個作爲一個班輪：

count_dict.setdefault(item,0)

來源

2017-04-09 21:13:38 WNG

這如何快速記錄字典中的項目，只需將其添加到您創建的任何代碼中

count_dict = {} 

aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , 
     'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y'] 

for element in aa_list: 
    count_dict[element]=(count_dict).get(element,0)+1 

print (count_dict)

來源

2017-04-09 21:18:00 citizen2077

您可以使用Counter類

>>> from collections import Counter 

>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)] 
>>> s 
[('A', 1), 
('A', 1), 
('Y', 1), 
('I', 1), 
('N', 1), 
('Y', 1), 
('P', 2), 
('M', 4), 
('S', 2), 
('Q', 1), 
('E', 2), 
('Q', 1), 
('I', 1), 
('I', 1), 
('A', 1), 
('Q', 1), 
('A', 1), 
('I', 1), 
('I', 1), 
('Q', 1), 
('E', 2), 
('C', 1), 
('Q', 1), 
('A', 1), 
('Q', 1), 
('I', 1), 
('I', 1)]

但是如果你有大量的數據集我可能是方式效率低下。

來源

2017-04-09 21:22:09 greole

啊，這很酷，我可以試試。但是'most_common（）[0]'做了什麼，因爲輸出只是給出了所有字母的數量..？ – ccaarroo

@ccaarroo：列表是所需的信息。第一個元組是序列中索引爲0的最常見字符，出現次數爲1。例如，您可以看到索引7處的「M」出現了4次。 –

'most_common（[n]）'列出n個最常見的元素。因此'most_common（）[0]'在位置i打印出最常見的單個元素。 – greole

修改後的代碼

這是您的代碼的修改後的工作版本。它效率不高，但應輸出正確的結果。

的幾個注意事項：

你需要爲每個索引一個計數器。所以你應該在第一個循環中初始化你的字典。
range(0,26)只有26個元素：從0到25（含）。
defaultdict可幫助您爲每個起始值定義0。
您需要增加計數器count_dict[item] += 1
在每個循環結束時，您需要找到具有最高值（出現）的關鍵字（氨基酸）。

from collections import defaultdict 

aa_seq = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 
      'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 
aa_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', # one-letter code for amino acids 
      'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] 

for p in range(27):     # first round:looks at the first position in each sequence 
    count_dict = defaultdict(int) # initialize counter with 0 as default value 
    for s in range(0, len(aa_seq)): # goes through all sequences of the list 
     # and checks for the occurrence of each amino acid letter (=item) 
     for item in aa_list: 
      if item in aa_seq[s][p]: 
       # if that letter occurs at the respective position, make it a 
       # key in the dictionary 
       count_dict[item] += 1 
    print(max(count_dict.items(), key=lambda x: x[1]))

它輸出：

('R', 1) 
('S', 1) 
('Y', 1) 
('S', 1) 
('E', 1) 
('P', 1) 
('P', 2) 
('M', 4) 
...

與反

替代你不需要很多的循環，你只需要在每個序列的每個字符遍歷一次。

此外，不需要重新發明輪子：Counter和most_common是比defaultdict和max更好的替代方案。

from collections import Counter 

aa_seqs = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK'] 

counters = [Counter() for i in range(27)] 

for aa_seq in aa_seqs: 
    for (i, aa) in enumerate(aa_seq): 
     counters[i][aa] += 1 

most_commons = [counter.most_common()[0] for counter in counters] 
print(most_commons)

它輸出：

[('K', 1), ('A', 1), ('Y', 1), ('N', 1), ('N', 1), ('Y', 1), ('P', 2), ('M', 4), ('S', 2), ('Q', 1), ('E', 2), ('G', 1), ('H', 1), ('N', 1), ('L', 1), ('N', 1), ('N', 1), ('I', 1), ('G', 1), ('H', 1), ('E', 2), ('G', 1), ('N', 1), ('K', 1), ('Y', 1), ('K', 1), ('G', 1)]

來源

2017-04-09 21:31:12

增加計數器作爲一個字典值的循環

回答

修改後的代碼

與反

相關問題