python更快的方法

我寫了這個代碼塊來得到噸的爆炸結果，但它似乎有點慢，因爲我使用兩個'for'循環來遍歷兩個文件。所以我想知道如果是更快，更貪婪的方式縮小迭代範圍。python更快的方法

下面的代碼

for tf_line in SeqIO.parse('deneme2.txt','fasta'): 
    tf_line.description=tf_line.description.split() 
    tempfile=open('tempfile.txt','w') 
    for cd_line in SeqIO.parse('Mus_musculus.GRCm38.74.cdna.all.fa','fasta'): 
     if cd_line.id==tf_line.description[1]: 
      tempfile.write('>'+cd_line.id+'\n'+ 
       str(cd_line.seq)[int(tf_line.description[2])-100: 
           int(tf_line.description[3])+100]) 
      tempfile.close() 
      os.system('makeblastdb -in tempfile.txt -dbtype nucl ' 
         '-out tempfile.db -title \'tempfile\'') 
      cline = NcbiblastnCommandline(query='SRR029153.fasta' , 
              db="tempfile.db", 
              outfmt=7, 
              out=(tf_line.description[0]+' '+ 
               tf_line.description[1])) 
      stdout,stderr=cline()

'deneme.txt' 是30 MB大，這樣的事情：

SRR029153.93098 ENSMUST00000103567 999 1147 TCAGGCCAAGTTTCTCTC

SRR029153.83280 ENSMUST00000181483 151 425 CAGGTTGAC

SRR029153.108993 ENSMUST00000184883 174 1415 TGGCACCTTTGC .....

'Mus_musculus.GRCm38.74.cdna.all.fa' 文件是170 MB大，是這樣的：

ENSMUST00000181483 ACACTGAAGAT .....

ENSMUST00000184883 ATCTTTTTTCTTTCAGGG .....

'Mus_musculus.GRCm38.74.cdna.all.fa'文件有一些序列號（ENSMUST ...），我必須找到'deneme.txt'文件和'Mus_musculus.GRCm38.74.cdna.all'之間的匹配。 F A。

這需要4-5個小時，但與此代碼需要至少10小時

任何幫助，將不勝感激，因爲我必須擺脫殘酷的算法，如這一點，並貪婪。謝謝

來源

2014-02-19 mehmet

'Mus_musculus.GRCm38.74.cdna.all.fa'有多大？好像不是每次讀取匹配，而是在解析'deneme2.txt'之前將數據緩存到哈希結構中（使用鍵的id），並對'tf_line.description [1 ]'？ – ernie

一般建議：如果你還沒有分析你的腳本，請這樣做，這樣你就知道哪些命令使用最多的時間。（只需運行'python -m cProfile myscript.py'。） – Carsten

幫助他人來幫助你。用簡單的英語描述任務（可能允許使用時間複雜性更好的算法）：輸入是什麼？它有多大？預期的結果是什麼？ *衡量*時間表現。提供別人可以嘗試的[獨立自足基準代碼]（http://stackoverflow.com/help/mcve）（它允許測試正確性和時間性能）。設定目標：速度足夠快。 – jfs

我認爲這仍然產生相同的爆炸，但應該快得多。閱讀代碼中的註釋以進一步優化：

tf_data = {key: (int(val1), int(val2)) for key, val1, val2 in 
      (line.description.split() for line in 
      SeqIO.parse('deneme2.txt','fasta'))} 

for cd_line in SeqIO.parse('Mus_musculus.GRCm38.74.cdna.all.fa','fasta'): 
    if cd_line.id in tf_data; 
     tempfile=open('tempfile.txt','w') 

     tf_val1, tf_va2 = tf_data[cd_line.id] 

     #If it is likely that the same tf_data-record is used many times 
     #move the math to the first line, if on the other hand it is 
     #very likely that most records won't be used in tf_data then 
     #move the int-casts back to the line below 
     tempfile.write('>{0}\n{1}'.format(
      cd_line.id, 
      str(cd_line.seq)[tf_val1 - 100: tf_val2 + 100])) 

     tempfile.close() 
     os.system('makeblastdb -in tempfile.txt -dbtype nucl ' 
        '-out tempfile.db -title \'tempfile\'') 
     cline = NcbiblastnCommandline(
      query='SRR029153.fasta', 
      db="tempfile.db", 
      outfmt=7, 
      out=("{0} {1}".format(tf_val1, tf_val2))) 

     #Since not using stderr and stdout don't assign variables 
     cline()

來源

2014-02-19 22:34:30 deinonychusaur

非常感謝。這是分離問題和擺脫所有不必要迭代的好方法。 – mehmet

你可以用正則表達式手動解析文件來加速第一個循環，但我認爲現在大部分時間都花在了爆破上。所以要改進你需要多處理。無論如何，如果你對解決方案感到滿意，你應該接受它;） – deinonychusaur

解析不是問題在這裏，它找到適當的匹配，但使用字典將加快代碼。多重處理是另一回事:)再次感謝您的時間。 – mehmet

python更快的方法

回答

相關問題