2013-07-25 62 views
2

連接編碼Scrapy蟒蛇:Unicode的鏈接錯誤

刮網站scrapy提取包含& AMD的鏈接,並拋出excption當

: 不要實例鏈接使用Unicode的URL對象。假設utf-8編碼(這可能是錯誤的),那麼我該如何解決這個錯誤?

+0

任何例子太多幫助! –

回答

0

我在插入某些鏈接時遇到了與此字符相同的問題。我發現this related commit GitHub上,比用於this advice寫一個文件link_extractors.py有:

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.utils.response import get_base_url 


class CustomLinkExtractor(SgmlLinkExtractor): 
"""Need this to fix the encoding error.""" 

    def extract_links(self, response): 
     base_url = None 
     if self.restrict_xpaths: 
      hxs = HtmlXPathSelector(response) 
      base_url = get_base_url(response) 
      body = u''.join(f for x in self.restrict_xpaths 
          for f in hxs.select(x).extract()) 
      try: 
       body = body.encode(response.encoding) 
      except UnicodeEncodeError: 
       body = body.encode('utf-8') 
     else: 
      body = response.body 

     links = self._extract_links(body, response.url, response.encoding, base_url) 
     links = self._process_links(links) 
     return links 

後來,我用它在我的spiders.py

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*',), 
          restrict_xpaths=("//li[contains(@class,'a-last')]/a",)), 
     callback='parse_start_url', follow=True, 

     ), 
)