與BeautifulSoup，請求和LXML

我試圖拉一些數據從一個流行的基於瀏覽器的遊戲，Python的解碼錯誤，但我有一些解碼錯誤麻煩：與BeautifulSoup，請求和LXML

import requests 
from bs4 import BeautifulSoup 

r = requests.get("http://www.neopets.com/") 
p = BeautifulSoup(r.text)

這將產生以下堆棧跟蹤：

Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 172, in __init__ 

File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 185, in _feed 

File "build/bdist.linux-x86_64/egg/bs4/builder/_lxml.py", line 195, in feed 
File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:87912) 
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97055) 
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8862) 
File "saxparser.pxi", line 274, in lxml.etree._handleSaxCData (src/lxml/lxml.etree.c:93385) 
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 476: invalid start byte

執行以下操作：

print repr(r.text[476 - 10: 476 + 10])

產地：

u'ttp-equiv="X-UA-Comp'

我真的不知道這裏的問題是什麼。任何幫助是極大的讚賞。謝謝。

來源

2012-10-18 Joshua Gilman

您是否嘗試過使用'r.content'？ BeautifulSoup爲你解碼，但'r.text'返回Unicode。 –

請參閱下面的評論。這似乎也失敗了。 –

.text一個響應返回解碼Unicode值，但也許你應該讓BeautifulSoup做解碼爲您提供：

p = BeautifulSoup(r.content, from_encoding=r.encoding)

r.content返回未解碼的原始字節串，並r.encoding是從檢測到編碼頭。

來源

2012-10-18 18:04:15

>>> P = BeautifulSoup（r.content）回溯（最近通話最後一個）： ... 的UnicodeDecodeError：無效延續字節 –

@ user1622821：在位置0 'utf-8' 編解碼器不能解碼字節0xd0我不能用這個URL重現你的問題，順便說一句。你的例子適用於我，我的例子也是如此。 –

這是否與我的本地安裝有關？我在CrunchBang上運行Python 2.6.6，並安裝了最新版本的libxml和libxslt。我使用easy_install安裝了BeautifulSoup4，請求和lxml。 –

與BeautifulSoup，請求和LXML

回答

相關問題