2
刮網站scrapy提取包含& AMD的鏈接,並拋出excption當
: 不要實例鏈接使用Unicode的URL對象。假設utf-8編碼(這可能是錯誤的),那麼我該如何解決這個錯誤?
: 不要實例鏈接使用Unicode的URL對象。假設utf-8編碼(這可能是錯誤的),那麼我該如何解決這個錯誤?
我在插入某些鏈接時遇到了與此字符→
相同的問題。我發現this related commit GitHub上,比用於this advice寫一個文件link_extractors.py
有:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url
class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
body = u''.join(f for x in self.restrict_xpaths
for f in hxs.select(x).extract())
try:
body = body.encode(response.encoding)
except UnicodeEncodeError:
body = body.encode('utf-8')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
return links
後來,我用它在我的spiders.py
:
rules = (
Rule(CustomLinkExtractor(allow=('/gp/offer-listing*',),
restrict_xpaths=("//li[contains(@class,'a-last')]/a",)),
callback='parse_start_url', follow=True,
),
)
任何例子太多幫助! –