如何在python中填充矩陣

我試圖編寫一個需要輸出爲矩陣的代碼，但由於是新手，我沒有正確理解它。基本上我想爲每列的A，C，G，T生成一個計數矩陣。我能夠做到這一點，但很難爲其他專欄做。如何在python中填充矩陣

輸入文件

>Rosalind_1 
ATCCAGCT 
>Rosalind_2 
GGGCAACT 
>Rosalind_3 
ATGGATCT 
>Rosalind_4 
AAGCAACC 
>Rosalind_5 
TTGGAACT 
>Rosalind_6 
ATGCCATT 
>Rosalind_7 
ATGGCACT

到目前爲止我的代碼

fh_in = open("consensus_seq.txt", 'r') 

A_count = 0 
C_count = 0 
G_count = 0 
T_count = 0 

result = [] 
for line in fh_in: 
    line = line.strip() 
    if not line.startswith(">"): 
     for nuc in line[0]: 
      if nuc == "A": 
       A_count += 1 
      if nuc == "C": 
       C_count += 1 
      if nuc == "G": 
       G_count += 1 
      if nuc == "T": 
       T_count += 1 
result.append(A_count) 
result.append(C_count) 
result.append(G_count) 
result.append(T_count) 
print result

輸出

[5, 0, 1, 1]

我想要的實際產量

A 5 1 0 0 5 5 0 0 
C 0 0 1 4 2 0 6 1 
G 1 1 6 3 0 1 0 0 
T 1 5 0 0 0 1 1 6

任何幫助/提示表示讚賞。

來源

2014-11-06 upendra

你可以使用numpy加載文本文件。由於格式是有點古怪，很難加載，但在這之後的總和變得微不足道：

import numpy as np 
data = np.loadtxt("raw.txt", comments=">", 
        converters = {0: lambda s: [x for x in s]}, dtype=str) 

print (data=="A").sum(axis=0) 
print (data=="T").sum(axis=0) 
print (data=="C").sum(axis=0) 
print (data=="G").sum(axis=0)

輸出：

[5 1 0 0 5 5 0 0] 
[1 5 0 0 0 1 1 6] 
[0 0 1 4 2 0 6 1] 
[1 1 6 3 0 1 0 0]

的真正的好處，這是你所構建的numpy的陣列可以做其他事情。例如，假設我想知道的是我們在「Rosalinds」列中發現的A的平均次數，而不是總和：

print (data=="A").mean(axis=0) 
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]

來源

2014-11-06 19:31:54 Hooked

這是如此簡單。謝謝.... – upendra 2014-11-07 20:20:27

import collections 
answer = [] 
with open('blah') as infile: 
    rows = [line.strip() for _,line in zip(infile, infile)] 
    cols = zip(*rows) 
    for col in cols: 
    d = collections.Counter(col) 
    answer.append([d[i] for i in "ATCG"]) 
    answer = [list(i) for i in zip(*answer)] 
for line in answer: 
    print(' '.join([str(i) for i in line]))

輸出：

來源

2014-11-06 02:47:53 inspectorG4dget

首先使行的列表，剝出開始的行>。然後你可以zip這把它變成列表的列表。然後你可以列出每個字母的列數。

rows = [line.strip() for line in infile if not line.startswith('>')] 
columns = zip(*rows) 
for letter in 'ACGT': 
    print letter, [column.count(letter) for column in columns]

但是，如果您的文件非常大，這可能會佔用大量內存。另一種方法是逐行排列字母。

counts = {letter: [0] * 8 for letter in 'ACGT'} 
for line in infile: 
    if not line.startswith('>'): 
     for i, letter in enumerate(line.strip()): 
      counts[letter][i] += 1 
for letter, columns in counts.items(): 
    print letter, columns

你也可以使用一個Counter，特別是如果你不知道提前多少列有如何將是：

from collections import Counter 
# ... 
counts = Counter() 
for line in infile: 
    if not line.startswith('>'): 
     counts.update(enumerate(line.strip())) 

columns = range(max(counts.keys())[0]) 
for letter in 'ACGT': 
    print letter, [counts[column, letter] for column in columns]

來源

2014-11-06 02:53:13 Stuart

+1。我知道拉鍊，但告訴我關於這個'zip（* rows）'，這裏代表什麼？ – Hackaholic 2014-11-06 03:08:31

*解壓縮參數列表https://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists。在這種情況下，它會將第一行作爲第一個參數，第二行作爲第二個參數，等等。 – Stuart 2014-11-06 03:11:07

如果列的數目不是8，該怎麼辦？有沒有辦法在計數列表中插入'len'函數？ – upendra 2014-11-06 03:22:37

如何在python中填充矩陣

回答

相關問題