2015-06-06 27 views
2

所以我正在嘗試編寫一個可用於分析DNA的程序,現在我正試圖在「分支」中分割基因。爲了發生這種情況,我需要分析鏈並用三個終止密碼子之一(三個鹼基對的組)分開它。我的代碼現在是這樣的:按容器分割字符串Python

class Strand: 

    def __init__(self, code): 
    self.code = [code] 
    self.endCodons = [] 
    self.genes = [] 

    def getGenes(self): 
    for codon in self.endCodons: 
     for code in self.code: 
     code = code.split(codon) 


strand = Strand("ATCATGCACATAGAAACTGATACACACCACAGTGATCACATGAAGTACACATG") 
strand.getGenes() 
print(strand.genes) 

但是,當我運行它時,它返回一個空列表。 我可以使用一些建議。

+4

你期望你的程序做什麼? 'self.endCodons'是一個空列表,所以'getGenes()'不會做任何事情。 – MattDMo

+0

什麼是* STOP密碼子*?你需要把它們放在你的問題中! – Kasramvd

+0

https://www.google.com/#q=stop+codons – MattDMo

回答

1

通過每個終止密碼子運行一個循環,並由此分裂將導致不正確的輸出,因爲我認爲這些終止密碼子可以以序列中的任何順序出現,並且對終止密碼子列表的迭代將要求停止位於相同的順序。

所以,如果我理解正確的話,您將需要掃描您的字符串由左到右,並搜索密碼子的方法:

class Strand: 
    def __init__(self, code): 
    self.code = code 
    self.endCodons = ["TAG", "TAA", "TGA"] 
    self.genes = [] 

    def getGenes(self): 
    if (len(self.code) % 3 != 0): 
     print("Input sequence is not divisible by 3?") 

    # In this, we assume each stop codon is always 3 characters. 
    iteration = 0 
    lastGeneEnd = 0 
    while (iteration < len(self.code)): 
     # What is our current 3 character sequence? (Unless it's at the end) 
     currentSequence = self.code[iteration:iteration + 3] 

     # Check if our current 3 character sequence is an end codon 
     if (currentSequence in self.endCodons): 
     # What will our gene length be? 
     geneLength = (iteration + 3) - lastGeneEnd 

     # Make sure we only break into multiples of 3 
     overlap = 3 - (geneLength % 3) 
     # There is no overlap if our length is already a multiple of 3 
     if (overlap == 3): overlap = 0 

     # Modify the gene length to reflect our overlap into a multiple of 3 
     geneLength = geneLength + overlap 

     # Update the iteration so we don't process any more than we need 
     iteration = iteration + overlap + 3 

     # Grab the entire gene sequence, including the stop codon 
     gene = self.code[lastGeneEnd:iteration] 

     # If we have a 3-length gene and there's nothing left, just append to the last gene retrieved as it has 
     # got to be part of the last sequence 
     if (len(gene) == 3 and iteration >= len(self.code)): 
      lastIndex = len(self.genes) - 1 
      self.genes[lastIndex] = self.genes[lastIndex] + gene 
      break 

     # Make sure we update the last end index so we don't include portions of previous positives 
     lastGeneEnd = iteration 

     # Append the result to our genes and continue 
     self.genes.append(gene) 

     continue 

     iteration = iteration + 1 

strand = Strand("ATCATGCACATAGAAACTGATACACACCACAGTGATCACATGAAGTACACATG") 
strand.getGenes() 
print("Got Genes: ") 
print(strand.genes) 

for gene in strand.genes: 
    print("Sequence '%s' is a multiple of 3: %u" % (gene, len(gene) % 3 == 0)) 

我不是一個真正的生物學家,所以我可能已經取得一些不正確的假設。

編輯:

的代碼,保證休息變成三的倍數,但我似乎還是不太明白所需的邏輯。它在給定的例子中工作,但我不確定它是否像其他情況下那樣工作。

+0

你的代碼工作的很好,只是它不會將基因分成3個組。例如:「AGTAGATAA」應該作爲一個基因出現,但它出現爲:「AGTAG ATAA」。它應該只將基因分解成3的倍數。 – mcchucklezz

+0

啊,好的,我會糾正它(因爲我誤解了它應該如何分裂)。 – Ragora

+0

它分裂成三倍的倍數,但我不知道它是否正是你所需要的任何方式。它也假定輸入長度總是可以被3整除(因此如果檢查任何懸掛的3字符序列),這是一個正確的假設嗎?它還假定你不會遇到一個終止密碼子作爲字符串中的第一個序列。 (序列從左到右流動,所以這不應該發生?)我認爲現在一個更好的例子是適當分割更長的序列,比如原始文章中的內容。 – Ragora