2
最近,使用Biopython從Pubmed中提取了一些摘要。 我的代碼寫在下面Python3:UnicodeDecodeError與使用Biopython從efetch獲取摘要
from Bio import Entrez
Entrez.email = "[email protected]" # Always tell NCBI who you are
def get_number(): #Get the total number of abstract available in Pubmed
handle = Entrez.egquery(term="allergic contact dermatitis ")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
if row["DbName"]=="pubmed":
return int(row["Count"])
def get_id(): #Get all the ID of the abstract available in Pubmed
handle = Entrez.esearch(db="pubmed", term="allergic contact dermatitis ", retmax=200)
record = Entrez.read(handle)
idlist = record["IdList"]
return idlist
idlist = get_id()
for ids in idlist: #Download the abstract based on their ID
handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text") # Retmode Can Be txt/json/xml/csv
f = open("{}.txt".format(ids), "w") # Create a TXT file with the name of ID
f.write(handle.read()) #Write the abstract to the TXT file
我想抽象,但它只有在獲得三個或四個抽象的成功。然後,發生一個錯誤:
UnicodeDecodeError: 'cp950' codec can't decode byte 0xc5 in position 288: illegal multibyte sequence
的handle.read()
看起來像具有與那些抽象問題,其中具有一定的符號或字。我嘗試使用print
知道類的handle
:
handle = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text")
print(handle)
結果是:
<_io.TextIOWrapper encoding='cp950'>
我已經找很多的解決方案的網頁,但他們沒有工作。誰能幫忙?
參見https://github.com/biopython/biopython/issues/1402 – peterjc