2016-11-04 41 views
0

我試圖從網站獲取文本數據,但是這段代碼顯示了一些錯誤。請讓我知道錯誤在哪裏。編碼UTF-8時出錯

import requests 

from bs4 import BeautifulSoup 

def getportions(soup): 

for p in soup.find_all("p", {"class": ""}): 
    yield p.text 


def readpage(address): 
    page = requests.get(address)  
    soup = BeautifulSoup(page.text, "html.parser") 
    output_text = '' 
    for s in getportions(soup): 
     output_text += s.encode("utf8") 
     output_text += "\n" 
    print (output_text) 
    print ("End of article") 
    fp = open("content.txt", "w") 
    fp.write(output_text) 
if __name__ == "__main__": 
    readpage("http://yahoo.com") 

錯誤如下所示:

output_text += s.encode("utf8"). TypeError: Can't convert 'bytes' object to str implicitly

+0

'.encode'返回'bytes'目的。你想做什麼? –

+0

@MorganThrapp我正在試圖寫一個文件的內容 –

+0

你可能是指'decode'嗎?你爲什麼認爲你需要用'utf-8'做任何事情? –

回答

1

如果你使用Python 3,所有的字符串都是本地的unicode的,你可以打開一個文件時,指定編碼。您的代碼將變成:

def readpage(address): 
    ... 
    output_text = '' 
    for s in getportions(soup): 
     output_text += s 
     output_text += "\n" 
    print (output_text) 
    print ("End of article") 
    fp = open("content.txt", "w", encoding='utf8') 
    fp.write(output_text) 

如果你只是想通過用?替換所有非ASCII字符打開文件的方式來消毒文本:

fp = open("content.txt", "w", encoding='ascii', errors='replace') 
+0

它顯示錯誤agin:return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError:'charmap'編解碼器不能編碼字符' \ u03a3'位置350:字符映射到

+0

@NARAYANCHANGDER:無法重現。顯示產生錯誤和堆棧跟蹤的代碼。 Utf8是爲了能夠編碼任何Unicode字符... –

+0

感謝它正在爲其他網頁工作 –