2012-10-01 76 views
3

我想在Python中使用Smith–Waterman algorithm實現本地序列對齊。如何決定在Smith-Waterman算法中回溯的方式?

這是我到目前爲止。它得到儘可能建設similarity matrix

import sys, string 
from numpy import * 

f1=open(sys.argv[1], 'r') 
seq1=f1.readline() 
f1.close() 
seq1=string.strip(seq1) 

f2=open(sys.argv[2], 'r') 
seq2=f2.readline() 
f2.close() 
seq2=string.strip(seq2) 

a,b =len(seq1),len(seq2) 

penalty=-1; 
point=2; 

#generation of matrix for local alignment 
p=zeros((a+1,b+1)) 

# table calculation and matrix generation 
for i in range(1,a+1): 
    for j in range(1,b+1): 
     vertical_score =p[i-1][j]+penalty; 
     horizontal_score= p[i][j-1]+penalty; 
     if seq1[i-1]==seq2[j-1]: 
      diagonal_score =p[i-1][j-1]+point; 
     else: 
      diagonal_score = p[i-1][j-1]+penalty; 
     p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); 

print p 

例如,兩個序列:

agcacact 
acacacta 

我的代碼輸出相似矩陣:

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
[ 0. 2. 1. 2. 1. 2. 1. 0. 2.] 
[ 0. 1. 1. 1. 1. 1. 1. 0. 1.] 
[ 0. 0. 3. 2. 3. 2. 3. 2. 1.] 
[ 0. 2. 2. 5. 4. 5. 4. 3. 4.] 
[ 0. 1. 4. 4. 7. 6. 7. 6. 5.] 
[ 0. 2. 3. 6. 6. 9. 8. 7. 8.] 
[ 0. 1. 4. 5. 8. 8. 11. 10. 9.] 
[ 0. 0. 3. 4. 7. 7. 10. 13. 12.]] 

現在我被困在算法的下一步,回溯構建最佳對齊。

Wikipedia says

爲了得到最佳局部比對,我們先從在基質中的最高值(Ĵ)。然後,我們倒退到位置之一(我- 1,Ĵ),(Ĵ - 1)和( - 1,Ĵ - 1)視用於構建矩陣的移動方向。我們保持這個過程直到我們到達一個零值的矩陣單元,或者位置(0,0)的值。

我無法確定回溯到其位置。維基百科所說的「取決於用於構建矩陣的移動方向」是什麼意思,以及我將如何在Python中實現這一點?

最後我做這個

import sys, string 
from numpy import* 
import re 

# read first sequence 

fasta_sequence1=open(sys.argv[1], 'r') 

seq1="" 
for i in fasta_sequence1: 
    if i.startswith(">"): 
     pass 
    else: 
     seq1 = seq1 + i.strip() 
fasta_sequence1.close() 

fasta_sequence2=open(sys.argv[2], 'r') 

seq2 = "" 
for i in fasta_sequence2: 
    if i.startswith('>'): 
     pass 
    else: 
     seq2 = seq2+ i.strip() 
fasta_sequence2.close() 

a,b =len(seq1),len(seq2) 


penalty=-1; 
point=2; 

#generation of matrix for local alignment 

p=zeros((a+1,b+1)) 

#intialization of max score 
max_score=0; 
#pointer to store the traceback path 

pointer=zeros((a+1,b+1)) 

# table calculation and matrix generation 
for i in range(1,a+1): 
    for j in range(1,b+1): 
     vertical_score =p[i-1][j]+penalty; 
     horizontal_score= p[i][j-1]+penalty; 
     if seq1[i-1]==seq2[j-1]: 
      diagonal_score =p[i-1][j-1]+point; 
     else: 
      diagonal_score = p[i-1][j-1]+penalty; 

for i in range(1,a+1): 
    for j in range(1,b+1):    
     p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); 

for i in range(1,a+1): 
    for j in range(1,b+1): 
     if p[i][j]==0: 
      pointer[i][j]=0; #0 means end of the path 
     if p[i][j]==vertical_score: 
      pointer[i][j]=1; #1 means trace up 
     if p[i][j]==horizontal_score: 
      pointer[i][j]=2; #2 means trace left 
     if p[i][j]==diagonal_score: 
      pointer[i][j]=3; #3 means trace diagonal 
     if p[i][j]>=max_score: 
      maximum_i=i; 
      maximum_j=j; 
      max_score=p[i][j]; 


#for i in range(1,a+1): 
    # for j in range(1,b+1): 
     #if p[i][j]>max_score: 
      #max_score=p[i][j] 

print max_score 

# traceback of all the pointers 

align1,align2='',''; #initial sequences 

i,j=max_i,max_j; #indices of path starting point 

while pointer[i][j]!=0: 
    if pointer[i][j]==3: 
    align1=align1+seq1[i-1]; 
    align2=align2+seq2[j-1]; 
    i=i-1; 
    j=j-1; 
    elif pointer[i][j]==2: 
    align1=align1+'-'; 
    align2=align2+seq2[j-1] 
    j=j-1; 
    elif pointer[i][j]==1: 
    align1=align1+seq1[i-1]; 
    align2=align2+'-'; 
    i=i-1; 



align1=align1[::-1]; #reverse sequence 1 
align2=align2[::-1]; #reverse sequence 2 

#output_file = open(sys.argv[3],"w") 
#output_file.write(align1) 

#output_file.write(align2) 

print align1 
print align2 

但我認爲有可能是這樣

結果的更好和更有效的方式顯示像

agcacacta-cacact 

上反義詞:

print align1 
print align2 

顯示正確的輸出:

agcacact 
a-cacact 

我怎樣才能在文件寫入上面的輸出。謝謝

回答

11

當您構建相似度矩陣時,您不僅需要存儲相似度分數,而且還需要存儲其中該分數來自。您目前有一行代碼:

p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score); 

所以在這裏你需要記住的不只是得分最高,但這些的是最大的。然後,當你來回溯時,你會知道哪個方向。

例如,你可以嘗試這樣的:

import numpy 

DELETION, INSERTION, MATCH = range(3) 

def smith_waterman(seq1, seq2, insertion_penalty = -1, deletion_penalty = -1, 
        mismatch_penalty = -1, match_score = 2): 
    """ 
    Find the optimum local sequence alignment for the sequences `seq1` 
    and `seq2` using the Smith-Waterman algorithm. Optional keyword 
    arguments give the gap-scoring scheme: 

    `insertion_penalty` penalty for an insertion (default: -1) 
    `deletion_penalty` penalty for a deletion (default: -1) 
    `mismatch_penalty` penalty for a mismatch (default: -1) 
    `match_score`  score for a match (default: 2) 

    See <http://en.wikipedia.org/wiki/Smith-Waterman_algorithm>. 

    >>> for s in smith_waterman('AGCAGACT', 'ACACACTA'): print s 
    ... 
    AGCAGACT- 
    A-CACACTA 
    """ 
    m, n = len(seq1), len(seq2) 

    # Construct the similarity matrix in p[i][j], and remember how 
    # we constructed it -- insertion, deletion or (mis)match -- in 
    # q[i][j]. 
    p = numpy.zeros((m + 1, n + 1)) 
    q = numpy.zeros((m + 1, n + 1)) 
    for i in range(1, m + 1): 
     for j in range(1, n + 1): 
      deletion = (p[i - 1][j] + deletion_penalty, DELETION) 
      insertion = (p[i][j - 1] + insertion_penalty, INSERTION) 
      if seq1[i - 1] == seq2[j - 1]: 
       match = (p[i - 1][j - 1] + match_score, MATCH) 
      else: 
       match = (p[i - 1][j - 1] + mismatch_penalty, MATCH) 
      p[i][j], q[i][j] = max((0, 0), deletion, insertion, match) 

    # Yield the aligned sequences one character at a time in reverse 
    # order. 
    def backtrack(): 
     i, j = m, n 
     while i > 0 or j > 0: 
      assert i >= 0 and j >= 0 
      if q[i][j] == MATCH: 
       i -= 1 
       j -= 1 
       yield seq1[i], seq2[j] 
      elif q[i][j] == INSERTION: 
       j -= 1 
       yield '-', seq2[j] 
      elif q[i][j] == DELETION: 
       i -= 1 
       yield seq1[i], '-' 
      else: 
       assert(False) 

    return [''.join(reversed(s)) for s in zip(*backtrack())] 
+0

恰尼如何在打印文件中的輸出與琴絃一個在另一個之上。 – Nodnin

+1

在第一個字符串之後和第二個字符串之前向該文件寫入一個換行符。 –

+0

由於它不匹配,2個對齊的字符串可能在同一位置包含不同的字母(當'MATCH'源自構造p時的不匹配) - 正確嗎?另外,雖然p [i] [j]!= 0''也可能是'while q [i] [j]!= 0' –