在特定索引處遞增列表的值python

我正在嘗試爲序列中的每個核苷酸（A，G，C，T）創建一個列表，其中列表的索引對應於序列中的位置，值在所有序列是核苷酸的頻率，這裏有4個序列爲例：在特定索引處遞增列表的值python

>ignore this 
GTAGGGCGA 
>ignore this 
GTATACAGC 
>ignore this 
GTTTCTCTT 
>ignore this 
GTAATCAAA

我寫的代碼：

def function(filename, length): 
    g,t,c,a = [],[],[],[] 
    with open(filename, "r") as f: 
     for line in f: 
      if line.startswith('GT'): 
       gcount, acount, tcount, ccount = 0, 0, 0, 0 
       g = [gcount + 1 if nuc == 'G' else gcount for nuc in line[:length]] 
       return g

眼下，這個代碼只是着眼於G核苷酸我得到了每個序列的列表，而不是1個列表，它們列出了每個索引處的值噸。

[1, 0, 0, 1, 1, 1, 0, 1, 0] 
[1, 0, 0, 0, 0, 0, 0, 1, 0] 
[1, 0, 0, 0, 0, 0, 0, 0, 0] 
[1, 0, 0, 0, 0, 0, 0, 0, 0]

我想什麼是我的輸出爲單獨G：

[4, 0, 0, 1, 1, 1, 0, 2, 0]

來源

2016-04-24 Ouwan12

您可以使用numpy這一點。只需將您的列表轉換爲numpy陣列並添加即可。

import numpy as np 

list1 = np.array([1, 0, 0, 1, 1, 1, 0, 1, 0]) 
list2 = np.array([1, 0, 0, 0, 0, 0, 0, 1, 0]) 
list3 = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0]) 
list4 = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0]) 

list1 + list2 + list3 +list4 # desired result! 
>>> array([4, 0, 0, 1, 1, 1, 0, 2, 0])

這裏是你如何修改當前的功能支持這一點：

def function(filename, length) 
    g,t,c,a = [],[],[],[] 
    # create an array of expected length of g filled with 0s 
    base = np.zeros((1,length)) # 1 row, `length` number of columns 
    with open(filename, "r") as f: 
     for line in f: 
      if line.startswith('GT'): 
       gcount, acount, tcount, ccount = 0, 0, 0, 0 
       g = np.array([gcount + 1 if nuc == 'G' else gcount for nuc in line[:length]]) 
       base = base + g # add this new numpy array 
    return base # return the summed result

這裏是numpy安裝說明。

來源

2016-04-24 22:51:21

我一直對numpy是什麼以及它如何適用於我的代碼感興趣。謝謝！然而，這個例子只顯示了4個序列，文件中可能有數千個序列。我可以修改這個代碼來解釋這個嗎？ – Ouwan12

@Ouwan當然。實際上，我已經通過修改'function'函數爲您提供了這樣的實現。 –

當我從這些例子中測試的文件切換到不同長度的序列時，我得到ValueError：操作數無法與形狀（1,9）（5，）一起廣播？ – Ouwan12

你可以這樣做：

g = [sum(_) for _ in zip(*[[1, 0, 0, 1, 1, 1, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0]])]

來源

2016-04-24 23:05:49 user312016

這可能是更容易一些理解。請注意，如果您的輸入文件過大，最終會失敗：

import collections 

Test_data = """>ignore this 
GTAGGGCGA 
>ignore this 
GTATACAGC 
>ignore this 
GTTTCTCTT 
>ignore this 
GTAATCAAA 
""" 


import io 
testfh = io.StringIO(Test_data) 

counts = [collections.Counter(fil) for fil in zip(*(x.strip() for x in testfh if not x.startswith('>')))] 

for key in 'ACGT': 
    key_counts = [cnt[key] for cnt in counts] 
    print("{}: {}".format(key, key_counts))

輸出看起來是這樣的：

A: [0, 0, 3, 1, 1, 0, 2, 1, 2] 
C: [0, 0, 0, 0, 1, 2, 2, 0, 1] 
G: [4, 0, 0, 1, 1, 1, 0, 2, 0] 
T: [0, 4, 1, 2, 1, 1, 0, 1, 1]

編輯

沒有內涵：

counts = [collections.Counter(fil) for fil in zip(*(x.strip() for x in testfh if not x.startswith('>')))]

變成這樣：

clean_lines = [] 
for x in testfh: 
    if not x.startswith('>'): 
     clean_lines.append(x.strip())

在這一點上，clean_lines只包含好的部分，沒有換行：

GTAGGGCGA 
GTATACAGC 
GTTTCTCTT 
GTAATCAAA

接下來，我把他們側身，爲了養活垂直條紋Counter：

file_and_rank = zip(*clean_lines)

在該行中，*clean_lines取每行（它是一個字符串）並將它們展平成一個大參數列表，就好像我已經調用過：

file_and_rank = zip('GTAGGGCGA', 'GTATACAGC', 'GTTTCTCTT', 'GTAATCAAA', ...)

zip操作組合了可迭代項。它同時循環所有的數據，從每個迭代中取一個數值。然後它將所有的值放到一個元組中，並返回它。因此，它轉換GTAGGGCGA，...串到元組與每個字符串的第一，第二，第三等字符：

(GGGG) 
    (TTTT) 
    (AATA) 
    (GTTA) 
    ...

接下來，我需要建立的多核苷酸是如何在每個位置對應的計。所以我會用collections.Counter（它需要一個可迭代的！），但是我必須爲每個位置分別設置一個。因此，一個人名單：

counts = [] 
for fil in file_and_rank: 
    counts.append(collections.Counter(fil))

我只是把一個元組，像（AATA），並把它傳遞給構造函數collections.Counter。這將產生一個與該位置{A:3, T:1}計數器。

來源

2016-04-24 23:20:26

這是非常好的。什麼是「fil」？另外，我是一名學生，希望向我展示如何在沒有列表理解的情況下使用計數線？謝謝！ – Ouwan12

我使用了'fil'，因爲'file'被保留，可能會令人困惑。基本上，它是一個垂直切片輸入（[見「等級」和「文件」]（https://en.wikipedia.org/wiki/Rank_and_file））。沒有理解，這將是巨大的。 –

和奧斯汀一樣，我推薦使用來自collections模塊的Counter。它是爲這種任務而構建的。以下是我的變化：爲每個核苷酸保留一個計數器，並向計數器提供每個核苷酸出現的位置。它像原始代碼一樣處理一行。

from collections import Counter 

def function(filename, length): 
    # a counter for each nucleotide 
    count = {'G':Counter(), 
      'T':Counter(), 
      'C':Counter(), 
      'A':Counter() 
      } 

    max_length = 0 

    with open(filename, "r") as f: 
     for line in f: 
      if line.startswith('GT'): 
       for position, nuc in enumerate(line): 
        # update position count for the nucleotides in this line 
        count[nuc].update([position]) 

       # keep track of longest line 
       max_length = max(max_length, position) 

    g = [[count[nuc][i] for i in range(max_length)] for nuc in 'GTCA'] 

    return g

來源

2016-04-25 01:40:46 RootTwo

嗯我收到以下錯誤：文件「monkey.py」，第31行，在函數（ 'introntest.txt'，9）文件「monkey.py」，第25行，在功能計數[ nuc] .update（position）更新 _count_elements（self，iterable）文件「C：\ Users \ Leddy \ AppData \ Local \ Programs \ Python \ Python35-32 \ lib \ collections \ __init__.py」 TypeError：'int'對象不可迭代 – Ouwan12

我的不好。該行應讀取'count [nuc] .update（[position]）'。修復了答案。 – RootTwo

在特定索引處遞增列表的值python

回答

相關問題