蛋白質序列模式匹配python

我正在研究蛋白質序列的匹配算法。我從一個對齊的蛋白質序列開始，我試圖將一個錯誤排列的序列轉換成正確對齊的序列。下面是一個例子：蛋白質序列模式匹配python

原始對齊序列：---- AB - CD -----

未對齊的序列：--a - BC --- D-

預期的輸出應該是這樣的：

原來的排列順序：---- AB - CD -----

未對齊的序列：---- AB - CD ---- - （都是現在一樣）

我被告知是非常具體關於我的問題，但我想匹配的序列長度> 4000個字符，並且在粘貼時看起來很荒謬。不過，我會發布代表我的問題的兩個序列，而且應該這樣做。

seq="---A-A--AA---A--" 
newseq="AA---A--A-----A-----" 
seq=list(seq) #changing maaster sequence from string to list 
newseq=list(newseq) #changing new sequence from string to list 
n=len(seq) #obtaining length of master sequence 
newseq.extend('.') #adding a tag to end of new sequence to account for terminal gaps 

print(seq, newseq,n) #verification of sequences in list form and length 

for i in range(n) 
    if seq[i]!=newseq[i]: 
     if seq[i] != '-': #gap deletion 
      del newseq[i] 

     elif newseq[i] != '-': 
      newseq.insert(i,'-') #gap insertion 


     elif newseq[i] == '-': 
      del newseq[i] 


old=''.join(seq) #changing list to string 
new=''.join(newseq) #changing list to string 
new=new.strip('.') #removing tag 

print(old) #verification of master-sequence fidelity 
print(new) #verification of matching sequence

我從這個特殊的代碼獲取和設置序列的輸出是：

--- AA - AA --- A--

--- AA - A- --- A ----- A -----

我似乎無法得到循環正確刪除字符之間不需要的破折號不止一次，因爲其餘的循環迭代被使用在添加短劃線/刪除短劃線對。
這是這裏問題的一個好開始。

我怎樣才能成功寫入該循環，以獲得期望的我的輸出（兩個相同的序列）

來源

2012-06-13 AHuck

沒有環路在此代碼示例 –

感謝您指出了這一點！我想我在混洗中丟失了循環命令。 – AHuck

我編輯你的代碼，它現在是給正確的輸出：

seq="----AB--C-D-----" 
newseq="--A--BC---D-" 
seq=list(seq) #changing maaster sequence from string to list 
newseq=list(newseq) #changing new sequence from string to list 
n=len(seq) #obtaining length of master sequence 
newseq.extend('.') #adding a tag to end of new sequence to account for terminal gaps 

print(seq, newseq,n) #verification of sequences in list form and length 
for i in range(len(seq)): 
    if seq[i]!=newseq[i]: 
     if seq[i]=='-': 
      newseq.insert(i,'-') 

     elif newseq[i]=='-': 
      newseq.insert(i,seq[i]) 
     else: 
      newseq.insert(i,seq[i]) 

else: 
    newseq=newseq[0:len(seq)] 

old=''.join(seq) #changing list to string 
new=''.join(newseq) #changing list to string 
new=new.strip('.') #removing tag 

print(old) #verification of master-sequence fidelity 
print(new) #verification of matching sequence

輸出：

----AB--C-D----- 
----AB--C-D-----

和AA---A--A-----A-----：

---A-A--AA---A-- 
---A-A--AA---A--

來源

2012-06-13 16:31:48

這個算法與前面的算法不一樣，考慮到特定位置，不同尺寸的字符串之間可能的不匹配，並且如果之後出現更好的解決方案，則不會回溯。請考慮研究動態編程。 – rlinden

我一定會爲未來的工作追求動態編程。儘管這些代碼一般用於我的直接用途（序列總是相同的順序，只有一個解決方案，並且此代碼適用於不同大小的字符串）。謝謝！ – AHuck

序列比對的問題是衆所周知的，它的溶液被很好地描述。有關介紹性文字，請參見Wikipedia。我所知道的最佳解決方案涉及動態編程，您可以在this site處看到Java中的示例實現。

來源

2012-06-13 16:30:32 rlinden

蛋白質序列模式匹配python

回答

相關問題