我有一個FASTA文件,可以很容易地被SeqIO.parse
解析。Biopython SeqIO到Pandas Dataframe
我有興趣提取序列號和序列長度。我用這些行做的,但我覺得這是waaaay太沉重(兩次迭代,轉換等)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
我只有一個迭代做到這一點,但我得到的字典:
records = SeqIO.parse(fastaFile, 'fasta')
,我莫名其妙地不能得到DataFrame.from_dict
工作...
我的目標是迭代FASTA文件,並獲得ID和序列長度爲DataFrame
每次迭代。
這是一個short FASTA file爲那些誰想要幫助。
感謝大衛,它做到了。也感謝「記憶」評論! – Sara