Scrapy蟒蛇：Unicode的鏈接錯誤

刮網站scrapy提取包含& AMD的鏈接，並拋出excption當

：不要實例鏈接使用Unicode的URL對象。假設utf-8編碼（這可能是錯誤的），那麼我該如何解決這個錯誤？

2013-07-25 user2619340

任何例子太多幫助！ –

我在插入某些鏈接時遇到了與此字符→相同的問題。我發現this related commit GitHub上，比用於this advice寫一個文件link_extractors.py有：

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.utils.response import get_base_url 


class CustomLinkExtractor(SgmlLinkExtractor): 
"""Need this to fix the encoding error.""" 

    def extract_links(self, response): 
     base_url = None 
     if self.restrict_xpaths: 
      hxs = HtmlXPathSelector(response) 
      base_url = get_base_url(response) 
      body = u''.join(f for x in self.restrict_xpaths 
          for f in hxs.select(x).extract()) 
      try: 
       body = body.encode(response.encoding) 
      except UnicodeEncodeError: 
       body = body.encode('utf-8') 
     else: 
      body = response.body 

     links = self._extract_links(body, response.url, response.encoding, base_url) 
     links = self._process_links(links) 
     return links

後來，我用它在我的spiders.py：

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*',), 
          restrict_xpaths=("//li[contains(@class,'a-last')]/a",)), 
     callback='parse_start_url', follow=True, 

     ), 
)

來源

2013-11-05 20:46:04 symbiotech

Scrapy蟒蛇：Unicode的鏈接錯誤

回答

相關問題