對齊Python中的DNA序列

我有成千上萬的DNA序列，範圍在100到5000 bp之間，我需要對齊並計算指定對的身份分數。 Biopython pairwise2做的不錯，但只適用於短序列，並且當序列大小大於2kb時，即使使用'score_only'和'one_alignment_only'選項，也會顯示嚴重的內存泄漏，導致'MemoryError'！對齊Python中的DNA序列

whole_coding_scores={} 
from Bio import pairwise2 
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences 
    alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True) 
    whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

結果從超級計算機返回：

Max vmem   = 256.114G #Memory usage of the script 
failed assumedly after job because: 
job 4945543.1 died through signal XCPU (24)

我知道有其他工具進行比對，但他們主要是可以只寫在輸出文件中的得分，需要加以檢索再次讀取和分析並使用對齊分數。是否有任何工具可以對齊序列並返回python環境中的對齊分數，因爲pairwise2會但沒有內存泄漏？

來源

2015-11-10 user3015703

對於全局比對可以嘗試NWalign https://pypi.python.org/pypi/nwalign/。我沒有使用它，但似乎可以恢復腳本中的對齊分數。

否則也許EMBOSS工具可以幫助：http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

來源

2015-11-10 05:50:50 jpalmer

不幸的是NWalign不適用於Python3！ – user3015703

首先，我用BioPython的針這一點。一個很好的HOWTO（忽略傳統設計:-)）可以發現here

二：也許你能避免使用發電機獲得一整套到內存？我不知道你的'whole_coding'對象來自哪裏。但是，如果它是一個文件，請確保您不讀取整個文件，然後遍歷內存對象。例如：

whole_coding = open('big_file', 'rt').readlines() # Will consume memory

但

for gene in open('big_file', 'rt'):  # will not read the whole thing into memory first 
    process(gene)

如果您需要處理，你可以寫一個發電機funtion：

def gene_yielder(filename): 
    for line in open('filename', 'rt'): 
     line.strip() # Here you preprocess your data 
     yield line  # This will return

然後

for gene in gene_yielder('big_file'): 
    process_gene(gene)

基本上，你希望你的程序充當管道：t它們流過它，並得到處理。準備肉湯時不要用它作爲烹飪鍋：添加一切，並加熱。我希望這個比較不是太牽強:-)

來源

2015-11-10 07:41:27 Ward

實際上，似乎沒有任何解決方案，沒有寫入和讀取數據的Python。我還用Biopython針解決了這個問題，但是一些額外的工作卻完成了工作。第二，整個字典是一個<20Mb的字典，它在前面的步驟中進行處理，只是提供序列，沒有相當多的內存消耗！ – user3015703

Biopython可以（現在）。 Biopython版本中的pairwise2模塊。 1.68（更快）並且可能需要更長的序列。下面是新的和舊的pairwise2的比較（在32位Python 2.7.11中，具有2 GB內存限制，64位Win7，Intel Core i5,2。8千兆赫）：

from Bio import pairwise2 

length_of_sequence = ... 
seq1 = '...'[:length_of_sequence] # Two very long DNA sequences, e.g. human and green monkey 
seq2 = '...'[:length_of_sequence] # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp) 
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)

舊pairwise2
- 最大長度/時間：〜1900個字符/ 10秒
新pairwise2
- 最大長度/時間：〜7450個字符/ 12秒
- 時間1900個字符：1秒

隨着score_only集到True，新pairwise2可以在6秒做的兩個序列〜8400個字符。

來源

2016-03-13 12:26:14 Markus

對齊Python中的DNA序列

回答

相關問題