如果utf-8編碼html文件包含非utf-8字符，該怎麼辦？

我正在嘗試BeautifulSoup解析以UTF-8編碼的html文件。但不幸的是，這個html文件包含幾個非UTF-8字符的字符，因此無法正確顯示。但這對我來說可以，因爲我可以簡單地跳過這些字符。如果utf-8編碼html文件包含非utf-8字符，該怎麼辦？

的問題是，即使我直接指定encodingFrom爲UTF-8：

soup = BeautifulSoup (html,fromEncoding='utf-8')

原來，soup.originalEncoding設置爲自動默認的Windows 1252。

print soup.originalEncoding 
windows-1252

我提到的BeautifulSoup文件，它的這樣寫：

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode: 

- An encoding you pass in as the fromEncoding argument to the soup 
    constructor. 
- An encoding discovered in the document itself 
- An encoding sniffed by looking at the first few bytes of the file. If 
    an encoding is detected at this stage, it will be one of the UTF-* 
    encodings, EBCDIC, or ASCII. 
- An encoding sniffed by the chardet library, if you have it installed. 
- UTF-8 
- Windows-1252

看來它應該使用fromEncoding我指定的，而不是下降到最後一個在列表中。

這裏是the original html I'm parsing供您參考。

來源

2012-01-18 Feng Li

@joelgoldstick，我會說故意編碼應該是utf-8（從html的標題部分）。但有可能在這個文件中，它包含一些不屬於utf-8編碼的字符（但很可能是windows-1252）。這可能是原因。但我寧願只獲取utf-8部分並省略windows-1252部分。 – 2012-01-18 16:14:09

如果您知道文件的編碼是什麼，請嘗試在將字符串傳遞給BeautifulSoup之前對其進行解碼，並明確忽略非utf8字符。

unicode_html = myfile.read().decode('utf-8', 'ignore') 
soup = BeautifulSoup (unicode_html)

來源

2012-01-18 16:29:23

您引用的頁面似乎通常是UTF-8編碼，但包含一些不能以UTF-8編碼數據出現的字節序列。它們可能是由不正確的代碼轉換或以另一種編碼插入數據造成的。但它只是「內容」數據。

UTF-8是「自同步」，所以如果你只是跳過錯誤的字節，剩下的事情應該沒問題 - 只要你到達HTML標記，一切都在ASCII範圍內。標記重要字符始終顯示爲小於0x80的單個字節。

來源

2012-01-18 16:05:45

我同意，但我無法找到使用美麗的解決方案。無論您指定的編碼是什麼（來自編碼），它總是落在windows-1252上，因爲有些字節序列不能出現在UTF-8編碼數據中。 – 2012-01-18 16:18:24

如果utf-8編碼html文件包含非utf-8字符，該怎麼辦？

回答

相關問題