頻率不能累加到一個

我正在寫一個函數，該函數應該通過DNA序列的.fasta文件，併爲文件中的每個序列創建一個核苷酸（nt）和二核苷酸（dnt）頻率字典。然後，我將每個字典存儲在名爲「頻率」的列表中。這是一段代碼，行事詭異：頻率不能累加到一個

for fasta in seq_file: 
    freq = {} 
    dna = str(fasta.seq) 
    for base1 in ['A', 'T', 'G', 'C']: 
     onefreq = float(dna.count(base1))/len(dna) 
     freq[base1] = onefreq 
     for base2 in ['A', 'T', 'G', 'C']: 
      dinucleotide = base1 + base2 
      twofreq = float(dna.count(dinucleotide))/(len(dna) - 1) 
      freq[dinucleotide] = twofreq 
    frequency.append(freq)

（我用biopython，順便說一下，這樣我就不必提交整個FASTA文件到內存同樣這是一個單鏈DNA，所以我不需要考慮反義dnt）

單個NT記錄的頻率增加到1.0，但dnt的頻率不會增加到1.0。這是因爲計算兩種頻率的方法在我眼中是相同的。

我離開了診斷打印報表，「檢查」變量：

for fasta in seq_file: 
    freq = {} 
    dna = str(fasta.seq) 
    check = 0.0 
    check2 = 0.0 
    for base1 in ['A', 'T', 'G', 'C']: 
     onefreq = float(dna.count(base1))/len(dna) 
     freq[base1] = onefreq 
     check2 += onefreq 
     for base2 in ['A', 'T', 'G', 'C']: 
      dinucleotide = base1 + base2 
      twofreq = float(dna.count(dinucleotide))/(len(dna) - 1) 
      check += twofreq 
      print(twofreq) 
      freq[dinucleotide] = twofreq 
    print("\n") 
    print(check, check2) 
    print(len(dna)) 
    print("\n") 
    frequency.append(freq)

得到這個輸出：（只適用於一個序列）

0.0894168466523 
0.0760259179266 
0.0946004319654 
0.0561555075594 
0.0431965442765 
0.0423326133909 
0.0747300215983 
0.0488120950324 
0.0976241900648 
0.0483801295896 
0.0539956803456 
0.0423326133909 
0.0863930885529 
0.0419006479482 
0.0190064794816 
0.031101511879 


(0.9460043196544274, 1.0) 
2316

在這裏我們可以看到頻率可能的16個不同dnt中的每一個，所有dnt頻率的總和（0.946）以及所有nt頻率的總和（1.0）和序列的長度。

爲什麼dnt頻率不加1.0？

感謝您的幫助。我對python很陌生，這是我的第一個問題，所以我希望這個提交是可以接受的。

來源

2015-05-27 Bantha

確定你正在使用Python 2？看起來你正在使用'print（）'函數，而不是'print'語句。 – TigerhawkT3

我正在使用python 2.我只是習慣於從一次冒險中寫出print（）函數，這是我曾經在Uni使用過的一種冒險。由於它從來沒有給我帶來麻煩，所以我從來沒有改變過打印語句。 – Bantha

在這種情況下，我建議使用'from __future__ import print_function'，這樣（例如）你的'print（check，check2）'打印'check'，然後'check2'分隔一個空格而不是打印元組' （check，check2）'。 – TigerhawkT3

str.count（）沒有計算它找到的重疊主題。

例：

如果你在你的序列「AAAA」，你去找二核苷酸「AA」，你還指望比「AAAA'.count（」 AA'）返回你3個，但它會返回2。所以：代替1

print float('AAAA'.count('AA'))/(len('AAAA') - 1) 
0.666666

你可以改變，你通過計數頻率行：

twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide])/float((len(dna) - 1))

來源

2015-05-27 17:06:58

謝謝！建議修復工程很棒。 – Bantha

歡迎！如果您的問題得到解決，請不要忘記接受答案。 –

您的問題，請嘗試使用以下FASTA：

 
>test 
AAAAAA

"AAAAAA".count("AA")

你：

應該

原因

從文檔：計數返回字符串s的（非重疊）的出現子子的數目[開始：結束]

溶液使用Counter和塊功能

from Bio import SeqIO 

def chunks(l, n): 
    for i in xrange(0, len(l)-(n-1)): 
    yield l[i:i+n] 

from collections import Counter 

frequency = [] 
input_file = "test.fasta" 
for fasta in SeqIO.parse(open(input_file), "fasta"): 
    dna = str(fasta.seq) 
    freq = Counter(dna) #get counter of single bases 
    freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide 
    frequency.append(freq)

爲「 AAAAAA「你得到：

Counter({'A': 6, 'AA': 5})

來源

2015-05-27 17:02:24

好的，謝謝！在這種情況下，什麼可以幫助我解決重疊問題？ – Bantha

@Bantha爲此更好使用收藏櫃檯 –

你要掃描的字符串遠遠超過你需要 - 20事實上，時間。這對於小的測試序列可能無關緊要，但是隨着它們變大，它會變得明顯。我會建議不同的方法，從而解決了這一問題展開重疊作爲一個副作用：

nucleotides = [ 'A', 'T', 'G', 'C' ] 
dinucleotides = [ x+y for x in nucleotides for y in nucleotides ] 
counts = { x : 0 for x in nucleotides + dinucleotides } 

# count the first nucleotide, which has no previous one 
n_nucl = 1 
prevn = dna[0] 
counts[prevn] += 1 

# count the rest, along with the pairs made with each previous one 
for nucl in dna[1:]: 
    counts[nucl] += 1 
    counts[prevn + nucl] += 1 
    n_nucl += 1 
    prevn = nucl 

total = 0.0 
for nucl in nucleotides: 
    pct = counts[nucl]/float(n_nucl) 
    total += pct 
    print "{} : {} {}%".format(nucl, counts[nucl], pct) 
print "Total : {}%".format(total) 

total = 0.0 
for dnucl in dinucleotides: 
    pct = counts[dnucl]/float(n_nucl - 1) 
    total += pct 
    print "{} : {} {}%".format(dnucl, counts[dnucl], pct) 
print "Total : {}%".format(total)

此方法僅通過串掃描一次，雖然它固然更多的代碼...

來源

2015-05-27 17:49:59 twalberg

頻率不能累加到一個

回答

相關問題