2016-03-01 34 views
1

我試圖從allabolag.se中提取一些數據。我想要關注例如http://www.allabolag.se/5565794400/befattningar但scrapy不能正確地獲取鏈接。它在URL中的「%2」後面缺少「52」。Scrapy - 由於編碼無法關注鏈接

例子,我想去: http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

但scrapy到達下面的鏈接:https://www.owasp.org/index.php/Double_Encoding

http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b

我在這個網站,它得到的東西做的編碼讀我如何解決這個問題?

我的代碼如下:

# -*- coding: utf-8 -*- 

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from allabolag.items import AllabolagItem 
from scrapy.loader.processors import Join 


class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'), 
    ) 

    def parse_link(self, response): 
     for sel in response.xpath('//*[@id="printContent"]'): 
      item = AllabolagItem() 
      item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract() 
      yield item 
+0

抓取時是否出現錯誤? – Rahul

回答

2

您可以配置鏈接提取不通過傳遞規範化的URL canonicalize=False

Scrapy shell會話:

$ scrapy shell http://www.allabolag.se/5565794400/befattningar 
>>> from scrapy.linkextractors import LinkExtractor 
>>> le = LinkExtractor() 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:48:07 [scrapy] DEBUG: Crawled (404) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%2C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 
>>> 

>>> le = LinkExtractor(canonicalize=False) 
>>> for l in le.extract_links(response): 
...  print l 
... 
(...stripped...) 
Link(url='http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b', text=u'', fragment='', nofollow=False) 
(...stripped...) 
>>> 
>>> fetch('http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') 
2016-03-02 11:47:42 [scrapy] DEBUG: Crawled (200) <GET http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b> (referer: None) 

所以你應該好搭配:

class allabolagspider(CrawlSpider): 
    name="allabolagspider" 
    # allowed_domains = ["byralistan.se"] 
    start_urls = [ 
     "http://www.allabolag.se/5565794400/befattningar" 
    ] 

    rules = (
     Rule(LinkExtractor(allow = "http://www.allabolag.se", 
          restrict_xpaths=('//*[@id="printContent"]//a[1]'), 
          canonicalize=False), 
      callback='parse_link'), 
    ) 
    ... 
+0

非常感謝,解決了這個問題! 順便說一句,請注意,你忘了以下行後的逗號:「restrict_xpaths =('// * [@ id =」printContent「] // a [1]')」 – brrrglund

+0

哦,對!謝謝。固定 –