如何使用entrez.efetch獲得特定的蛋白質序列？

我想通過使用Biopython的Entrez.fetch()函數通過基因ID（GI）編號從NCBI獲取蛋白質序列。如何使用entrez.efetch獲得特定的蛋白質序列？

proteina = Entrez.efetch(db="protein", id= gi, rettype="gb", retmode="xml").

我然後使用讀取數據：

proteinaXML = Entrez.read(proteina).

我可以打印結果，但是我不知道如何讓蛋白質序列孤單。

一旦顯示結果，我可以手動到達蛋白質。或者我檢查XML樹使用：

proteinaXML[0]["GBSeq_feature-table"][2]["GBFeature_quals"][6]['GBQualifier_value'].

但是，根據提交的蛋白質的GI，XML樹可以不同。難以使這一過程穩健自動化。

我的問題：是否有可能只檢索蛋白質序列，而不是整個XML樹？或者：如果XML文件的結構可能因蛋白質而異，我怎樣才能從XML文件中提取蛋白質序列？

感謝

來源

2013-11-14 daniel

好一點，在XML數據庫中的條目都提出不同作者的蛋白質之間變化。

。我有一個算法，以「追捕」的蛋白質序列從XML樹：

import os 
import sys 
from Bio import Entrez 
from Bio import SeqIO 

gi   = '1293613'   # example gene id     
Entrez.email= "[email protected]" # Always tell NCBI who you are 
protina  = Entrez.efetch(db="protein", id=gi, retmode="xml") # fetch the xml 
protinaXML = Entrez.read(protina)[0] 

seqs = []   # store candidate protein seqs 
def seqScan(xml): # recursively collect protein seqs 
    if str(type(xml))=="<class 'Bio.Entrez.Parser.ListElement'>": 
     for ele in xml: 
      seqScan(ele) 
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.DictionaryElement'>": 
     for key in xml.keys(): 
      seqScan(xml[key]) 
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.StringElement'>": 
     # v___THIS IS THE KEYWORD FILTER_____v 
     if (xml.startswith('M') and len(xml))>10: # 1) all proteins start with M (methionine) 
      seqs.append(xml)      # 2) filters out authors starting with M 

seqScan(protinaXML) # run the recursive sequence collection 
print(seqs)   # print the goods!

注：在極少數情況下（根據「關鍵字過濾」），它可以幽默地搶不想要的字符串，例如作者名開始，其縮寫名稱是超過10個字符長（下圖）的「M」：

enter image description here

希望有所幫助！

來源

2013-11-25 17:52:11

如何使用entrez.efetch獲得特定的蛋白質序列？

回答

相關問題