我的python代碼提取基於tblastn命中的核苷酸序列太慢

我做了多個查詢tblastn搜索使用獨立的爆炸在我的本地數據庫從python腳本和輸出是一個大的XML文件。我需要解析輸出只取前四個命中，因爲一些查詢返回的命中次數超過四次。但是由於輸出是作爲氨基酸序列比對記錄給出的，因此對於文件中的每個tblastn記錄，解析hsps開始和結束的原始核苷酸序列的命中對象的高評分對是很重要的。我的python代碼提取基於tblastn命中的核苷酸序列太慢

所以我有這個代碼，但它很痛苦地放緩，並且考慮到它可以花費超過一個月的時間來完成它正在做的事情。有人可以幫助我改進替代方案嗎？

> from Bio import SeqIO 
> from Bio.Blast import NCBIXML 
> 
> infile_path = '/home/edson/ungulate/ungulate.fa' # this is a file 
> which contain unaligned nucleotide sequences outfile_path = 
> '/home/edson/ungulate/tblastn_result.fa' 
> 
> for seq_record in SeqIO.parse(infile_path, 'fasta'): 
>  flag = seq_record.description   # a flag is sequence identifier in a fasta file format 

     with open(outfile_path, 'a') as outfile: 
      with open('/home/edson/ungulate/tblastn_result.xml') as tblastn_file: 
      tblastn_records = NCBIXML.parse(tblastn_file) 
      for tblastn_record in tblastn_records: 
       for alignment in tblastn_record.alignments[:4]: 
        for hsp in alignment.hsps: 
         if flag in alignment.title:   
          # this cross check if sequence identifier is present in an XML file 
>       sub_record = seq_record.seq[hsp.sbjct_start:hsp.sbjct_end] 
          # this takes sequences in an infile path and slice them based on tblastn output 
>       outfile.write('>' + seq_record.description + '\n') 
>       outfile.write(str(sub_record + '\n'))

謝謝。

來源

2013-12-20 user3014974

爲什麼大於號？你是否點擊了報價按鈕，然後是代碼按鈕？ – user2357112

「tblastn」和「alignment.hsps」（典型）的「len」是什麼？你的程序哪一部分很慢？ – goncalopp

您分配'flag'的循環會在第一次之後的每次迭代中覆蓋'flag'的舊值。你想要一個標誌列表嗎？或者'for'循環只有一次迭代？ – user2357112

至少有兩個明顯的瓶頸 - 對每個迭代外環，你

重新outfile
重新和重新解析tblastn_file

只是移動這些外循環之外的操作應該會產生一些明顯的性能改進（如果你有多次外循環迭代）當然是op）。

另一種可能的改進：您在alignment.hsps上每次迭代測試flag in alignment.title。本次測試將是所有hsps常數相同hsps所以不如之前所說的那樣，即：

for alignment in tblastn_record.alignments[:4]: 
    if flag in alignment.title: 
     for hsp in alignment.hsps: 
      # etc...

來源

2013-12-20 11:27:03

我的python代碼提取基於tblastn命中的核苷酸序列太慢

回答

相關問題