我正在工作,我需要解析一個網站與美麗的湯。該網站是http://www.manta.com,但是當我嘗試在HTML代碼的meta中查看網站的編碼時,不會顯示任何內容。我嘗試在本地解析HTML,與下載的網頁,但我有一些解碼錯誤麻煩:美麗的湯解碼錯誤
# manta web page downloaded before
html = open('1.html', 'r')
soup = BeautifulSoup(html, 'lxml')
這將產生以下堆棧跟蹤:
Traceback (most recent call last):
File "E:/Projects/Python/webkit/sample.py", line 10, in <module>
soup = BeautifulSoup(html, 'lxml')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in
lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717)
File "parsertarget.pxi", line 142, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:100104)
File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored
(src\lxml\lxml.etree.c:9387)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml
\lxml.etree.c:96065)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data
我m嘗試在Beautiful Soup的構造函數中引入編碼:
soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding")
而且我繼續得到相同的錯誤。
有趣的是,如果我在瀏覽器中加載頁面,然後在Firefox中將編碼更改爲utf-8並保存。這項工作很好。任何幫助都非常感謝。謝謝。
我無法打開www.manta.com的是正確的URL字符串? – justhalf
試試這個:[鏈接](http://www.manta.com/mb_43_A0_19/advertising_marketing/louisiana?pg=2)這是我想解析的頁面之一 –
我無法從這裏打開它。僅適用於美國嗎? (順便說一句,如何使註釋超鏈接?) – justhalf