將大型fasta拆分爲多個文件，無法用GI編號命名它們

我應該首先說出我對Python和Biopython都是新的。我試圖將一個大的.fasta文件（包含多個條目）分成單個文件，每個文件都有一個條目。我在Biopython wiki/Cookbook網站上發現了大部分以下代碼，並對其進行了一些修改。我的問題是，這個生成器將它們命名爲「1.fasta」，「2.fasta」等，我需要它們以一些標識符（例如GI號）命名。將大型fasta拆分爲多個文件，無法用GI編號命名它們

def batch_iterator(iterator, batch_size) : 
    """Returns lists of length batch_size. 

    This can be used on any iterator, for example to batch up 
    SeqRecord objects from Bio.SeqIO.parse(...), or to batch 
    Alignment objects from Bio.AlignIO.parse(...), or simply 
    lines from a file handle. 

    This is a generator function, and it returns lists of the 
    entries from the supplied iterator. Each list will have 
    batch_size entries, although the final list may be shorter. 
    """ 
    entry = True #Make sure we loop once 
    while entry : 
     batch = [] 
     while len(batch) < batch_size : 
      try : 
       entry = next(iterator) 
      except StopIteration : 
       entry = None 
      if entry is None : 
       #End of file 
       break 
      batch.append(entry) 
     if batch : 
      yield batch 

from Bio import SeqIO 
infile = input('Which .fasta file would you like to open? ') 
record_iter = SeqIO.parse(open(infile), "fasta") 
for i, batch in enumerate(batch_iterator(record_iter, 1)) : 
    outfile = "c:\python32\myfiles\%i.fasta" % (i+1) 
    handle = open(outfile, "w") 
    count = SeqIO.write(batch, handle, "fasta") 
    handle.close()

如果我試圖取代：

outfile = "c:\python32\myfiles\%i.fasta" % (i+1)

有：

outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id)

，使其將其命名爲類似的用法類似於SeqIO到seq_record.id的東西，它提供了以下錯誤：

Traceback (most recent call last): 
    File "C:\Python32\myscripts\generator.py", line 33, in [HTML] 
    outfile = "c:\python32\myfiles\%s.fasta" % (record_iter.id) 
AttributeError: 'generator' object has no attribute 'id'

儘管基因rator函數沒有屬性「id」，我能以某種方式解決這個問題嗎？這個劇本對於我想要做的事情來說太複雜了嗎？！？謝謝，查爾斯

來源

2012-05-30 user1426421

因爲你只想要一個記錄的時間，你可以溝batch_iterator包裝和枚舉：

for seq_record in record_iter:

然後你想要的是每條記錄的id屬性，而不是迭代器作爲一個整體：

for seq_record in record_iter: 
    outfile = "c:\python32\myfiles\{0}.fasta".format(seq_record.id) 
    handle = open(outfile, "w") 
    count = SeqIO.write(seq_record, handle, "fasta") 
    handle.close()

供您參考，發電機誤差的事實是，你正試圖從對象record_iter獲得屬性id結果。 record_iter不是一個單獨的記錄，而是一組記錄，它們是作爲Python生成器保存的，這與正在進行中的列表類似，因此您不必一次讀取整個文件，內存使用更有效率。更多關於發電機：What can you use Python generator functions for?,http://docs.python.org/tutorial/classes.html#generators,

來源

2012-05-30 16:18:56 Karmel

似乎最好和最簡單的方法。打開輸出文件將更清潔與'打開（outfile，「w」）作爲句柄：' – weronika

感謝所有人的幫助！ – user1426421

或者，而不是在您的代碼中打開，請使用Biopython來執行此操作： count = SeqIO.write（seq_record，outfile，「fasta」） – peterjc

將大型fasta拆分爲多個文件，無法用GI編號命名它們

回答

相關問題