我正在嘗試BeautifulSoup解析以UTF-8編碼的html文件。但不幸的是,這個html文件包含幾個非UTF-8字符的字符,因此無法正確顯示。但這對我來說可以,因爲我可以簡單地跳過這些字符。如果utf-8編碼html文件包含非utf-8字符,該怎麼辦?
的問題是,即使我直接指定encodingFrom爲UTF-8:
soup = BeautifulSoup (html,fromEncoding='utf-8')
原來,soup.originalEncoding設置爲自動默認的Windows 1252。
print soup.originalEncoding
windows-1252
我提到的BeautifulSoup文件,它的這樣寫:
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
- An encoding you pass in as the fromEncoding argument to the soup
constructor.
- An encoding discovered in the document itself
- An encoding sniffed by looking at the first few bytes of the file. If
an encoding is detected at this stage, it will be one of the UTF-*
encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
看來它應該使用fromEncoding我指定的,而不是下降到最後一個在列表中。
這裏是the original html I'm parsing供您參考。
@joelgoldstick,我會說故意編碼應該是utf-8(從html的標題部分)。但有可能在這個文件中,它包含一些不屬於utf-8編碼的字符(但很可能是windows-1252)。這可能是原因。但我寧願只獲取utf-8部分並省略windows-1252部分。 – 2012-01-18 16:14:09