BeautifulSoup字符代碼錯誤

我正在使用BeautifulSoup刮取網站信息。具體而言，我想收集有關谷歌搜索（標題，發明人，摘要等）專利的信息。我會爲每個專利的URL列表，但BeautifulSoup是有某些網站的麻煩，給我以下錯誤：BeautifulSoup字符代碼錯誤

的UnicodeDecodeError：「UTF-8」編解碼器不能在531位解碼字節的0xCC：無效延續字節

下面是錯誤回溯：

Traceback (most recent call last): 
    soup = BeautifulSoup(the_page,from_encoding='utf-8') 
    File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__ 
    self._feed() 
    File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed 
    self.builder.feed(self.markup) 
    File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed 
    self.parser.close() 
    File "parser.pxi", line 1209, in lxml.etree._FeedParser.close (src\lxml\lxml.etree.c:90597) 
    File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99984) 
    File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99807) 
    File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:9383) 
    File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml\lxml.etree.c:95945) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte

我檢查網站的編碼，並且它聲稱是「UTF-8」。我也將它指定爲BeautifulSoup的輸入。以下是我的代碼：

import urllib, urllib2 
from bs4 import BeautifulSoup 

#url = 'https://www.google.com/patents/WO2001019016A1?cl=en' # This one works 
url = 'https://www.google.com/patents/WO2006016929A2?cl=en' # This one doesn't work 

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
values = {'name' : 'Somebody', 
      'location' : 'Somewhere', 
      'language' : 'Python' } 
headers = { 'User-Agent' : user_agent } 

data = urllib.urlencode(values) 
req = urllib2.Request(url, data, headers) 
response = urllib2.urlopen(req) 
the_page = response.read() 

print response.headers['content-type'] 
print response.headers.getencoding() 

soup = BeautifulSoup(the_page,from_encoding='utf-8')

我收錄了兩個網址。一個導致錯誤，另一個正常工作（在評論中標記爲這樣）。在這兩種情況下，我都可以將html打印到終端上，但是BeautifulSoup一直崩潰。

有什麼建議嗎？這是我第一次使用BeautifulSoup。

來源

2013-07-11 user1911297

我使用Python 2.7，BeautifulSoup4在Windows – user1911297

你應該在編碼UTF-8的字符串：

soup = BeautifulSoup(the_page.encode('UTF-8'))

來源

2013-09-24 06:56:15 justhalf

BeautifulSoup字符代碼錯誤

回答

相關問題