的Python：UnicodeDecodeError錯誤：「UTF-8」編解碼器不能解碼字節...無效延續字節

我建立Python的3.3的Python：UnicodeDecodeError錯誤：「UTF-8」編解碼器不能解碼字節...無效延續字節

使用BeautifulSoup網絡刮板但是我得到它阻止我獲得的一個問題我可以使用BeautifulSoup的有效絃樂*。那就是：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 7047: invalid continuation byte

我知道有幾十個類似的問題，但我至今沒有發現一種方法，可以幫助我來診斷什麼是錯用下面的代碼：

import urllib.request 
URL = "<url>" # sorry, I cannot show the url for privacy reasons, but it's a normal html document 
page = urllib.request.urlopen(URL) 
page = page.read().decode("utf-8") # from bytes to <source encodings>

正如我猜測我注意到這個錯誤只發生在一些URLS而不是其他人。即使有這個相同的錯誤，我直到昨天才發現這個錯誤。然後今天我再次運行該程序，並彈出錯誤..

任何線索如何診斷錯誤？

來源

2014-10-28 dragonmnl

您應該而不是解碼響應。首先，你錯誤地認爲響應是UTF-8編碼（不是，如錯誤所示），但更重要的是，BeautifulSoup會爲你檢測編碼。請參閱BeautifulSoup文檔的Encodings section。

將一個字節字符串傳遞給BeautifulSoup，它會使用任何<meta>頭部來宣告正確的編碼，或者爲您自動檢測編碼。

在這種自動檢測失敗，你總是可以回退到服務器提供的編碼事件：

encoding = page.info().get_charset() 
page = page.read() 
soup = BeautifulSoup(page) 
if encoding is not None and soup.original_encoding != encoding: 
    print('Server and BeautifulSoup disagree') 
    print('Content-type states it is {}, BS4 states thinks it is {}'.format(encoding, soup.original_encoding) 
    print('Forcing encoding to server-supplied codec') 
    soup = BeautifulSoup(page, from_encoding=encoding)

這仍然留下了實際解碼BeautifulSoup，但如果服務器包含在charset參數Content-Type頭然後上面假定服務器配置正確，並強制BeautifulSoup使用該編碼。

來源

2014-10-28 15:44:50

的Python：UnicodeDecodeError錯誤：「UTF-8」編解碼器不能解碼字節...無效延續字節

回答

相關問題