2014-12-26 26 views
1

如果基因名稱存儲在文本文件中,使用biopython如何從fasta文件中剪切我感興趣的基因?python - 從fasta文件中選擇性地選擇核苷酸序列?

#extract genes     
f1 = open('ortholog1.txt','r') 
f2 = open('all.fasta','r') 
f3 = open('ortholog1.fasta','w') 

genes = [line.rstrip('\n') for line in f1.readlines()] 


i=0 
for seq_record in SeqIO.parse(f2, "fasta"): 
    if genes[i] == seq_record.id:    
     print genes[i] 
     f3.write('>'+genes[i]) 
     i=i+1 
     if i==18: 
      break 
     f3.write('\n') 
     f3.write(str(seq_record.seq)) 
     f3.write('\n') 


f2.close() 
f3.close() 

我正在嘗試上面的代碼。但它有一些錯誤,不是通用的,因爲像ortholog1.txt(其中包含基因名稱)還有5個更類似的文件。此外,每個文件中的基因數量也不盡相同(並非始終如此)。這裏all.fasta是包含所有基因的文件。 ortholog1.fasta必須包含剪切的核苷酸序列。

回答

1

基本上,你可以讓Biopython做所有的工作。

我會猜測「ortholog1.txt」中的基因名稱與fasta文件中的基因名稱完全相同,並且每行有一個基因名稱。如果沒有,你需要根據需要調整它們,使它們排列起來。

from Bio import SeqIO 

with open('ortholog1.txt','r') as f: 
    orthologs_txt = f.read() 
orthologs = orthologs_txt.splitlines() 

genes_to_keep = [] 
for record in SeqIO.parse(open('all.fasta','r'), 'fasta'): 
    if record.description in orthologs: 
     genes_to_keep.append(record) 

with open('ortholog1.fasta','w') as f: 
    SeqIO.write(genes_to_keep, f, 'fasta') 

編輯:這是保持在同一順序輸出基因作爲同源物文件的一種方法:

from Bio import SeqIO 

with open('all.fasta','r') as fasta_file: 
    record_dict = SeqIO.to_dict(open(SeqIO.parse(fasta_file, 'fasta') 

with open('ortholog1.txt','r') as text_file: 
    orthologs_txt = text_file.read() 

genes_to_keep = []  
for ortholog in orthologs_txt.splitlines(): 
    try: 
     genes_to_keep.append(record_dict[ortholog]) 
    except KeyError: 
     pass 

with open('ortholog1.fasta','w') as output_file: 
    SeqIO.write(genes_to_keep, output_file, 'fasta') 
+0

有什麼辦法進行排序,根據在ortholog1基因相同的順序輸出的fasta文件基因順序。文本? –

+0

讓他們保持秩序的一種方法是將記錄讀入字典中,然後逐句通過orthologs。查看更新後的答案 – iayork

1

我不是一個biopython專家,所以我會離開的細節輸出給你,但數據流可以很簡單的(代碼註釋)

# Initially we have a dictionary where the keys are gene names and the 
# values are all empty lists 
genes = {gene.strip():[] for gene in open('ortholog1.txt','r')} 

# parse the data 
for record in SeqIO.parse(open('all.fasta','r'), "fasta"): 
    # append the current record if its id is in the keys of genes 
    if record.id in genes: 
     genes[record.id].append(record) 

with open('ortholog1.fasta','w') as fout: 
    # we read again, line by line, the file of genes 
    for gene in open('ortholog1.txt','r'): 
     # if the list associated with the current gene is not empty 
     if genes[gene.strip()]: 
      # output the list of records for the current gene using 
      # biopython facilities 
      ... 
相關問題