2013-10-22 99 views
4

我是一名生物學研究生,在過去的幾個月裏我教了自己很少量的python來處理我的一些數據。我不是要求功課的幫助,這是一個研究項目。翻譯DNA到蛋白質

用這段代碼,我打算取一部分稱爲序列的字符串,它們之間:找到「蛋白質翻譯」的起始位點,或者第一個出現的位置(生物學術語是起始密碼子),然後第一個發生TAA(終止密碼子)。

然後函數translate_dna()應該爲字符串中的每三個字母交換字典值。變量CDS正常存在,但,或者如果環在我的功能無法正常工作:(任何建議輸入文件格式如下:?

>gnl|GNOMON|230560476.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1) 
CCCCAGTAGCTGGGATTACAGGTTATCCAAGGACATGGAAAAGCCAACACCATGGTAGCATTAATGAAAG 
TTTACCAAGAGGAAGATGAAGCCTACCAGGAATTAGTTACCATGGCAACCATGTTTTTCCAGTACTTACT 
GCAGCCATTTAGGGCTATGCGAGAAGTTGCAACTTTATGTAAGCTTGAT 

>gnl|GNOMON|230560472.m Model predicted by Gnomon on Homo sapiens unplaced genomic scaffold, alternate assembly HuRef DEGEN_1103279082069, whole genome shotgun sequence (NW_001841731.1) 
GCCGGCGTTTGACCGCGCTTGGGTGGCCTGGGACCCTGTGGGAGGCTTCCCCGGCGCCGAGAGCCCTGGC 
TGACGGCTGATGGGGAGGAGCCGGCGGGCGGAGAAGGCCACGGGCTCCCCAGTACCCTCACCTGCGCGGG 
ATCGCTGCGGGAAACCAGGGGGAGCTTCGGCAGGGCCTGCAGAGAGGACAAGCGAAGTTAAGAGCCTAGT 
GTACTTGCCGCTGGGAGCTGGGCTAGGCCCCCAACCTTTGCCCTGAAGATGCTGGCAGAGCAGGATGTTG 
TAACGGGAAATGTCAGAAATACTGCAAGCAAACTGAAAACAACCCATCCATGTAGGAAAGAATAACACGG 
ACTACACACTATGAGGAAACCACAGGGGAGTTTCAGGCCAGTCAGCTTTTGATCTTCAACTTTATAACTT 
TCACCTTAGGATATGACGAGCCCACCGGAGTTTCAAAAATGGTATCATTTTGTATCAGGCTTGTTTTTTA 
CACTCTTGGTTTCTCACAGAGATAGGTGGTTTCTCCTTAAAATCGAACATTTATATGATGCATTTTACTG 
TAGTTACTATCAGAAAAGTTAGTTTTCCCAAATTTAAGTTCACTCTGGGGTACTATAGCGTGAATGTAGT 
TCATTCTGTTGAGCTAGTTGTTCATGTTAGTGTAGTTCACATATTTATCTGGAACTCAAAAATGAGGGGT 
TGAGAGGGGAAGCTAAAATTCAAAACATGTCCAAATATATAATTTTAATATTTTACTTTATATTTAAAAT 
AGAAAAGCAATTGATTCTAGAATTAGACTAATTGCTAGCATTGCTAGGATATATAAAATGAAGCTGAATG 
TTTTAACTCTGGAATTTTTCTGAATAGTCTAAGAAATAAGGCTGAAGTGTATCACTTGCCTTAAGTTTAC 
TTTTGCGTGTGTGTTTTAATTTTGTTCAGTGGGGCTTTCACTTAAAAAAAAAACCATAATATTATTACCT 
GGATAAAAAATACAGCTGAAAGTAGATCACTTTATCTTTAAGCAGAAGGATGGAAATAGAAGAATTTTAA 
GAATGTATTGGTTGAAAAACATCTATATTATTTTATTTTTATTTCTCTTCTTGTGGGAGTAAAATAATTT 
CCAACCAAATCAGTCCACCTAGATTATACACTGTTCAGTTTGTTTTCTGCCCTGCAGCACAAGCAATAAC 
CAGCAGAGACTGGAACCACAGCTGAGGCTCTGTAAATGAGTTGACTGCTAAGGACTTCATGGGGATATTA 
ACCTGGGGCATTAAGAGAATCAACATGCTAAAGTACTTGGAGACAGCTCTGTAATGTTTTATGAGGTTTT 
TTGTTTTTTTTTTTTGAGACAGAGTCTTGCACTGTCGCCCAGGCTGG 

代碼:

from sys import argv 
script, filename = argv 

def translate_dna(sequence): 

    codontable = { 
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 
    proteinsequence = '' 
    start = sequence.find('ATG') 
    sequencestart = sequence[int(start):] 
    stop = sequencestart.find('TAA') 
    cds = str(sequencestart[:int(stop)+3]) 

    for n in range(0,len(cds),3): 
     if cds[n:n+3] in codontable == True: 
      proteinsequence += codontable[cds[n:n+3]] 
      print proteinsequence 
     sequence = '' 


header = '' 
sequence = '' 
for line in open(filename): 
    if line[0] == ">": 
     if header != '': 
      print header 
      translate_dna(sequence) 

     header = line.strip() 
     sequence = '' 
    else: 
     sequence += line.strip() 

print header 
translate_dna(sequence) 
+0

如果這不是你爲什麼不使用例如功課[Biopyton(HTTP:/ /biopython.org/DIST/docs/tutorial/Tutorial.html#htoc25)? – zero323

+0

有人向我指出這一點,而我其實是從那裏開始的。我不相信Biopython限制框架開始和終止密碼子的第一個。這是好的,如果你不確定你有的RNA數據,但是我知道我有mRNA,它有一個明確的單獨開始和結束位點,所以當我得到結果時有多個開始和結束氨基酸,有些是錯誤的 –

+0

If這個關於biopython的假設是不正確的,那麼我一定讀過食譜,甚至有人強調我的閱讀錯誤會有所幫助。我不反對任何解決方案。 –

回答

5

你的問題從線

if cds[n:n+3] in codontable == True 

這始終計算爲False,這樣的話你永遠不會附加到proteinsequence莖,只需卸下== True部分LIK e so

if cds[n:n+3] in codontable 

你會得到蛋白質序列。另外,請務必在translate_dna()return proteinsequence

+0

非常感謝!刪除==真實並添加回報給了我我需要的東西。如果我能我會upvote! –

+1

@KarlEricSwanson即使您無法註冊,您也可以接受答案。在投票計數下勾選複選框 – goncalopp

2

您的代碼還有一個問題 - 當您使用stop = sequencestart.find('TAA')時,您並不關心打開的閱讀框。在下面的代碼我分裂序列分成三胞胎使用itertools.takewhile來處理,但它可以做到循環使用,以及:

from itertools import takewhile 

def translate_dna(sequence, codontable, stop_codons = ('TAA', 'TGA', 'TAG')):  
    start = sequence.find('ATG') 

    # Take sequence from the first start codon 
    trimmed_sequence = sequence[start:] 

    # Split it into triplets 
    codons = [trimmed_sequence[i:i+3] for i in range(0, len(trimmed_sequence), 3)] 
    print(len(codons)) 
    print(trimmed_sequence) 
    print(codons) 

    # Take all codons until first stop codon 
    coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons) 

    # Translate and join into string 
    protein_sequence = ''.join([codontable[codon] for codon in coding_sequence]) 

    # This line assumes there is always stop codon in the sequence 
    return "{0}_".format(protein_sequence) 
+0

也謝謝您。這是完美的,我不是權威,但從我看到的,似乎比我原來的代碼更pythonic。 –

+0

哦,在查看結果之後,這很接近,但我認爲密碼子= [trimmed_sequence [i:i + 3](range(len(trimmed_sequence)/ 3)]應該看起來像密碼子= [trimmed_sequence [i:i + 3]爲範圍內的我(0,len(trimmed_sequence),3)] –

+0

否則它讀取字符串中的每三個字符並從原始開始移動+1。那就是ATGG從ATG產生MW,0-2個字符和TGG,1-3個字符等等,用於更長的序列。 –

相關問題