用蟒/ biopython計數DNA序列

我的腳本下面從一個標準FASTA文件計數序列「CCCCAAAA」和「GGGGTTTT」的出現：用蟒/ biopython計數DNA序列

>contig00001 
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT

腳本計數這裏CCCCAAAA序列3倍

CCCCAAAACCCCAAAACCCCAAAA（CCCC不計算在內）

有人可以請告知我怎麼會在末尾作爲半計數CCCC序列返回的3.5這個值。

我到目前爲止一直沒有成功。

我的腳本如下...

from Bio import SeqIO 

input_file = open('telomer.test.fasta', 'r') 
output_file = open('telomer.test1.out.tsv','w') 
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n') 

for cur_record in SeqIO.parse(input_file, "fasta") : 


    contig = cur_record.name 
    CCCCAAAA_count = cur_record.seq.count('CCCCAAAA') 
    CCCC_count = cur_record.seq.count('CCCC') 

    GGGGTTTT_count = cur_record.seq.count('GGGGTTTT') 
    GGGG_count = cur_record.seq.count('GGGG') 
    #length = len(cur_record.seq) 

    splittedContig1=contig.split(CCCCAAAA_count) 

    splittedContig2=contig.split(GGGGTTTT_count) 

    cnt1=len(splittedContig1)-1 
    cnt2=len(splittedContig2) 

    cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count 
    cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count 

    output_line = '%s\t%i\t%i\n' % \ 
    (CONTIG, CCCCAAAA_count, GGGGTTTT_count) 


    output_file.write(output_line) 

output_file.close() 

input_file.close()

來源

2014-04-01 sheaph

您可以使用拆分和startwith列表理解如下：

contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT" 
splitbase="CCCCAAAA" 
halfBase="CCCC" 
splittedContig=contig.split(splitbase) 
cnt=len(splittedContig)-1 
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])

輸出：

3.5

分裂字符串基於CCCCAAAA。它將使名單中，CCCCAAAA將被刪除的分裂
長列表中的元素 - 1給出了CCCCAAAA
的發生次數，尋找元素與CCCC開始。如果找到，則爲每個事件添加0.5來計數。

來源

2014-04-01 16:01:07 user3

這本身工作得很好，但是當我嘗試集成到我的腳本中時出現錯誤：SyntaxError：無法分配給函數調用甚至有可能這樣做？ – sheaph

你是如何整合你的腳本的？ – user3

我輸入了我的編輯版本 – sheaph

用蟒/ biopython計數DNA序列

回答

相關問題