2013-09-24 64 views
0

我正在工作,我需要解析一個網站與美麗的湯。該網站是http://www.manta.com,但是當我嘗試在HTML代碼的meta中查看網站的編碼時,不會顯示任何內容。我嘗試在本地解析HTML,與下載的網頁,但我有一些解碼錯誤麻煩:美麗的湯解碼錯誤

# manta web page downloaded before 
html = open('1.html', 'r') 
soup = BeautifulSoup(html, 'lxml') 

這將產生以下堆棧跟蹤:

Traceback (most recent call last): 
    File "E:/Projects/Python/webkit/sample.py", line 10, in <module> 
    soup = BeautifulSoup(html, 'lxml') 
    File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__ 
    self._feed() 
    File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed 
    self.builder.feed(self.markup) 
    File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed 
    self.parser.close() 
    File "parser.pxi", line 1209, in 
    lxml.etree._FeedParser.close(src\lxm\lxml.etree.c:90717) 
    File "parsertarget.pxi", line 142, in 
    lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:100104) 
    File "parsertarget.pxi", line 130, in 
    lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99927) 
    File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored 
    (src\lxml\lxml.etree.c:9387) 
    File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml 
    \lxml.etree.c:96065) 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 105-106: invalid data 

我m嘗試在Beautiful Soup的構造函數中引入編碼:

soup = BeautifulSoup(html, 'lxml', from_encoding= "some encoding") 

而且我繼續得到相同的錯誤。

有趣的是,如果我在瀏覽器中加載頁面,然後在Firefox中將編碼更改爲utf-8並保存。這項工作很好。任何幫助都非常感謝。謝謝。

+0

我無法打開www.manta.com的是正確的URL字符串? – justhalf

+0

試試這個:[鏈接](http://www.manta.com/mb_43_A0_19/advertising_marketing/louisiana?pg=2)這是我想解析的頁面之一 –

+0

我無法從這裏打開它。僅適用於美國嗎? (順便說一句,如何使註釋超鏈接?) – justhalf

回答

1

編碼爲UTF-8

soup = BeautifulSoup(html.encode('UTF-8'),'lxml') 
+0

我試着在編碼過程中得到另一個編碼錯誤 –

+0

你看到了什麼錯誤? – justhalf

+0

當我把文件的所有行放在一個字符串中,並且我嘗試執行_ht = _ht.encode('utf-8')時,我獲得:Traceback(最近調用最後一個): File「E:/ Projects/Python/webkit/sample.py「,第15行,在 _ht = _ht.encode('utf-8') UnicodeDecodeError:'utf8'編解碼器無法解碼位置152380中的字節0xbb:意外的代碼字節 –