UnicodeDecodeError與使用Biopython從efetch獲取摘要

最近，使用Biopython從Pubmed中提取了一些摘要。我的代碼寫在下面Python3：UnicodeDecodeError與使用Biopython從efetch獲取摘要

from Bio import Entrez 

Entrez.email = "[email protected]" # Always tell NCBI who you are 


def get_number(): #Get the total number of abstract available in Pubmed 
    handle = Entrez.egquery(term="allergic contact dermatitis ") 
    record = Entrez.read(handle) 
    for row in record["eGQueryResult"]: 
     if row["DbName"]=="pubmed": 
      return int(row["Count"]) 


def get_id(): #Get all the ID of the abstract available in Pubmed 
    handle = Entrez.esearch(db="pubmed", term="allergic contact dermatitis ", retmax=200) 
    record = Entrez.read(handle) 
    idlist = record["IdList"] 
    return idlist 

idlist = get_id() 

for ids in idlist: #Download the abstract based on their ID 
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text") # Retmode Can Be txt/json/xml/csv 
    f = open("{}.txt".format(ids), "w") # Create a TXT file with the name of ID 
    f.write(handle.read()) #Write the abstract to the TXT file

我想抽象，但它只有在獲得三個或四個抽象的成功。然後，發生一個錯誤：

UnicodeDecodeError: 'cp950' codec can't decode byte 0xc5 in position 288: illegal multibyte sequence

的handle.read()看起來像具有與那些抽象問題，其中具有一定的符號或字。我嘗試使用print知道類的handle：

handle = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text") 
print(handle)

結果是：

<_io.TextIOWrapper encoding='cp950'>

我已經找很多的解決方案的網頁，但他們沒有工作。誰能幫忙？

來源

2017-03-21 K. Yu

參見https://github.com/biopython/biopython/issues/1402 – peterjc

對我來說你的代碼工作正常。這是您網站上的編碼問題。您可以以字節爲單位的模式打開文件，並在編碼UTF-8 文字你可以嘗試的解決方法是這樣的：

for ids in idlist: #Download the abstract based on their ID 
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text") # Retmode Can Be txt/json/xml/csv 
    f = open("{}.txt".format(ids), "wb") # Create a TXT file with the name of ID 
    f.write(handle.read().encode('utf-8'))

來源

2017-03-22 14:38:55 scienceisthenewblack

UnicodeDecodeError與使用Biopython從efetch獲取摘要

回答

相關問題