我正在嘗試使用Python的BeautifulSoup庫進行一些簡單的網頁抓取,並且在嘗試解析大多數YouTube網頁時遇到UnicodeDecodeError。BeautifulSoup無法解析YouTube頁面
看來YouTube正在爲HTMl提供無效字符。當然,這是他們的一個問題,但我認爲BeautifulSoup的重點在於它可以處理不正確的頁面,並盡力猜測結果。如果它丟棄了無效字符,我會很高興。我遠離Unicode專家,我嘗試過的各種魔法咒語encode
和decode
對我沒有任何好處。
有沒有人對如何處理這個錯誤有任何建議。我不想讓我的代碼專用YouTube,因爲它需要處理大量用戶指定的網頁。
以下是一個演示問題非常簡單的代碼片段:
import urllib
from bs4 import BeautifulSoup
url='https://www.youtube.com/watch?v=W9MzrirPrCI'
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text)
最後一行導致以下錯誤:
UnicodeDecodeError Traceback (most recent call last)
/cygdrive/d/home/ll-virtualenv/lib/python2.7/site-packages/Django-1.5.1-py2.7.egg/django/core/management/commands/shell.pyc in <module>()
----> 1 soup = BeautifulSoup(text)
/cygdrive/d/home/ll-virtualenv/lib/python2.7/site-packages/bs4/__init__.pyc in __init__(self, markup, features, builder, parse_only, from_encoding, **kwargs)
170
171 try:
--> 172 self._feed()
173 except StopParsing:
174 pass
/cygdrive/d/home/ll-virtualenv/lib/python2.7/site-packages/bs4/__init__.pyc in _feed(self)
183 self.builder.reset()
184
--> 185 self.builder.feed(self.markup)
186 # Close out any unfinished strings and close all the open tags.
187 self.endData()
/cygdrive/d/home/ll-virtualenv/lib/python2.7/site-packages/bs4/builder/_lxml.pyc in feed(self, markup)
193 def feed(self, markup):
194 self.parser.feed(markup)
--> 195 self.parser.close()
196
197 def test_fragment_to_document(self, fragment):
/usr/lib/python2.7/site-packages/lxml-3.1.0-py2.7-cygwin-1.7.17-i686.egg/lxml/etree.dll in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:88786)()
/usr/lib/python2.7/site-packages/lxml-3.1.0-py2.7-cygwin-1.7.17-i686.egg/lxml/etree.dll in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:98085)()
/usr/lib/python2.7/site-packages/lxml-3.1.0-py2.7-cygwin-1.7.17-i686.egg/lxml/etree.dll in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97909)()
/usr/lib/python2.7/site-packages/lxml-3.1.0-py2.7-cygwin-1.7.17-i686.egg/lxml/etree.dll in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:9071)()
/usr/lib/python2.7/site-packages/lxml-3.1.0-py2.7-cygwin-1.7.17-i686.egg/lxml/etree.dll in lxml.etree._handleSaxData (src/lxml/lxml.etree.c:94081)()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd7 in position 22: invalid continuation byte
嘗試使用scrapy代替。 – 2013-05-03 16:46:05
我使用的版本是4.1.3,它工作正常 – Moj 2013-05-03 17:05:29
如果我回到BeautifulSoup的第3版,它的工作原理。 4.1.3仍然沒有給出上述錯誤。 Moj,你是否像我一樣使用相同的URL? – 2013-05-04 13:47:26