2016-02-25 56 views
3

生物學家使用字母A,C,T和G的序列對基因組進行建模。一個基因是一個基因組的替代,該基因組開始於三聯體ATG之後並在三聯體TAG,TAA或TGA之前結束。此外,基因串的長度是3的倍數,並且基因不含任何三聯體ATG,TAG,TAA和TGA。生物信息學:查找給定基因組字符串的基因

理想:

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT #Enter 
TTT 
GGGCGT 
----------------- 
Enter a genome string: TGTGTGTATAT 
No Genes Were Found 

到目前爲止,我有:

def findGene(gene): 
    final = "" 
    genep = gene.split("ATG") 
    for part in genep: 
     for chr in part: 
      for i in range(0, len(chr)): 
       if genePool(chr[i:i + 3]) == 1: 
        break 
       else: 
        final += (chr[i+i + 3] + "\n") 
    return final 

def genePool(part): 
    g1 = "ATG" 
    g2 = "TAG" 
    g3 = "TAA" 
    g4 = "TGA" 
    if (part.count(g1) != 0) or (part.count(g2) != 0) or (part.count(g3) != 0) or (part.count(g4) != 0): 
     return 1 

def main(): 
    geneinput = input("Enter a genome string: ") 
    print(findGene(geneinput)) 

main() 
# TTATGTTTTAAGGATGGGGCGTTAGTT 

我一直運行到錯誤

要完全誠實的,這是真的不是爲我工作 - 我認爲我已經用這些代碼行走了一條死路 - 一種新的方法可能會有所幫助。

在此先感謝!

,我一直得到的錯誤 -

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT 
Traceback (most recent call last): 
    File "D:\Python\Chapter 8\Bioinformatics.py", line 40, in <module> 
    main() 
    File "D:\Python\Chapter 8\Bioinformatics.py", line 38, in main 
    print(findGene(geneinput)) 
    File "D:\Python\Chapter 8\Bioinformatics.py", line 25, in findGene 
    final += (chr[i+i + 3] + "\n") 
IndexError: string index out of range 

就像我之前說的,我真的不知道,如果我在正確的軌道上解決問題與我當前的代碼上 - 什麼新的想法w^/僞代碼表示讚賞!

+0

提取哪些錯誤? – mhawke

+0

您打算將它用於大型數據集嗎?還是僅用於短小片段? – Moritz

+0

@mhawke我似乎遇到了圍繞'[i:i + 3]'旋轉的錯誤 - 例如,當切片('[i:i + 3]'部分)出現問題時,將耗盡索引空間。這有幫助嗎? –

回答

3

這可以用regular expression來完成:

import re 

pattern = re.compile(r'ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)') 
pattern.findall('TTATGTTTTAAGGATGGGGCGTTAGTT') 
pattern.findall('TGTGTGTATAT') 

輸出

 
['TTT', 'GGGCGT'] 
[] 

說明從https://regex101.com/r/yI4tN9/3

"ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)"g 
    ATG matches the characters ATG literally (case sensitive) 
    1st Capturing group ((?:[ACTG]{3})+?) 
     (?:[ACTG]{3})+? Non-capturing group 
      Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy] 
      [ACTG]{3} match a single character present in the list below 
       Quantifier: {3} Exactly 3 times 
       ACTG a single character in the list ACTG literally (case sensitive) 
    (?:TAG|TAA|TGA) Non-capturing group 
     1st Alternative: TAG 
      TAG matches the characters TAG literally (case sensitive) 
     2nd Alternative: TAA 
      TAA matches the characters TAA literally (case sensitive) 
     3rd Alternative: TGA 
      TGA matches the characters TGA literally (case sensitive) 
    g modifier: global. All matches (don't return on first match) 
+0

我不熟悉're'。一個非常簡短的探索? - 謝謝@mhawke –

+1

@MattRumbel:基本上,重排模式尋找ATG,然後是鹼基的最短序列,直到看到一個哨兵三聯體TAG,TAA或TGA。這將匹配Stidgeon評論的序列中包含ATG的基因。 (我不知道ATG是否應該包括在內,我不是生物學家) – mhawke

+0

感謝您的解釋,這很容易理解 –