Scrapy - 由於編碼無法關注鏈接

我試圖從allabolag.se中提取一些數據。我想要關注例如http://www.allabolag.se/5565794400/befattningar但scrapy不能正確地獲取鏈接。它在URL中的「％2」後面缺少「52」。Scrapy - 由於編碼無法關注鏈接

例子，我想去： http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

但scrapy到達下面的鏈接：https://www.owasp.org/index.php/Double_Encoding

： http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

我在這個網站，它得到的東西做的編碼讀我如何解決這個問題？

我的代碼如下：

# -*- coding: utf-8 -*- 

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from allabolag.items import AllabolagItem 
from scrapy.loader.processors import Join 


class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'), 
    ) 

    def parse_link(self, response): 
     for sel in response.xpath('//*[@id="printContent"]'): 
      item = AllabolagItem() 
      item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      yield item

來源

2016-03-01 brrrglund

抓取時是否出現錯誤？ – Rahul

您可以配置鏈接提取不通過傳遞規範化的URL canonicalize=False

Scrapy shell會話：

$ scrapy shell http://www.allabolag.se/5565794400/befattningar 
>>> from scrapy.linkextractors import LinkExtractor 
>>> le = LinkExtractor() 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 
>>> 

>>> le = LinkExtractor(canonicalize=False) 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None)

所以你應該好搭配：

class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", 
          restrict_xpaths=('//*[@id="printContent"]//a[1]'), 
          canonicalize=False), 
      callback='parse_link'), 
    ) 
    ...

來源

2016-03-02 10:52:37

非常感謝，解決了這個問題！順便說一句，請注意，你忘了以下行後的逗號：「restrict_xpaths =（'// * [@ id =」printContent「] // a [1]'）」 – brrrglund

哦，對！謝謝。固定 –

Scrapy - 由於編碼無法關注鏈接

回答

相關問題