2012-05-30 53 views
0

我應該首先說出我對Python和Biopython都是新的。我試圖將一個大的.fasta文件(包含多個條目)分成單個文件,每個文件都有一個條目。我在Biopython wiki/Cookbook網站上發現了大部分以下代碼,並對其進行了一些修改。我的問題是,這個生成器將它們命名爲「1.fasta」,「2.fasta」等,我需要它們以一些標識符(例如GI號)命名。將大型fasta拆分爲多個文件,無法用GI編號命名它們

def batch_iterator(iterator, batch_size) : 
    """Returns lists of length batch_size. 

    This can be used on any iterator, for example to batch up 
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch 
    Alignment objects from Bio.AlignIO.parse(...), or simply 
    lines from a file handle. 

    This is a generator function, and it returns lists of the 
    entries from the supplied iterator. Each list will have 
    batch_size entries, although the final list may be shorter. 
    """ 
    entry = True #Make sure we loop once 
    while entry : 
     batch = [] 
     while len(batch) < batch_size : 
      try : 
       entry = next(iterator) 
      except StopIteration : 
       entry = None 
      if entry is None : 
       #End of file 
       break 
      batch.append(entry) 
     if batch : 
      yield batch 

from Bio import SeqIO 
infile = input('Which .fasta file would you like to open? ') 
record_iter = SeqIO.parse(open(infile), "fasta") 
for i, batch in enumerate(batch_iterator(record_iter, 1)) : 
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1) 
    handle = open(outfile, "w") 
    count = SeqIO.write(batch, handle, "fasta") 
    handle.close() 

如果我試圖取代:

outfile = "c:\python32\myfiles\%i.fasta" % (i+1) 

有:

outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id) 

,使其將其命名爲類似的用法類似於SeqIO到seq_record.id的東西,它提供了以下錯誤:

Traceback (most recent call last): 
    File "C:\Python32\myscripts\generator.py", line 33, in [HTML] 
    outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id) 
AttributeError: 'generator' object has no attribute 'id' 

儘管基因rator函數沒有屬性「id」,我能以某種方式解決這個問題嗎?這個劇本對於我想要做的事情來說太複雜了嗎?!?謝謝,查爾斯

回答

2

因爲你只想要一個記錄的時間,你可以溝batch_iterator包裝和枚舉:

for seq_record in record_iter: 

然後你想要的是每條記錄的id屬性,而不是迭代器作爲一個整體:

for seq_record in record_iter: 
    outfile = "c:\python32\myfiles\{0}.fasta".format(seq_record.id) 
    handle = open(outfile, "w") 
    count = SeqIO.write(seq_record, handle, "fasta") 
    handle.close() 

供您參考,發電機誤差的事實是,你正試圖從對象record_iter獲得屬性id結果。 record_iter不是一個單獨的記錄,而是一組記錄,它們是作爲Python生成器保存的,這與正在進行中的列表類似,因此您不必一次讀取整個文件,內存使用更有效率。更多關於發電機:What can you use Python generator functions for?,http://docs.python.org/tutorial/classes.html#generators,

+0

似乎最好和最簡單的方法。打開輸出文件將更清潔與'打開(outfile,「w」)作爲句柄:' – weronika

+0

感謝所有人的幫助! – user1426421

+0

或者,而不是在您的代碼中打開,請使用Biopython來執行此操作: count = SeqIO.write(seq_record,outfile,「fasta」) – peterjc

相關問題