我試圖從allabolag.se中提取一些數據。我想要關注例如http://www.allabolag.se/5565794400/befattningar但scrapy不能正確地獲取鏈接。它在URL中的「%2」後面缺少「52」。Scrapy - 由於編碼無法關注鏈接
但scrapy到達下面的鏈接:https://www.owasp.org/index.php/Double_Encoding
我在這個網站,它得到的東西做的編碼讀我如何解決這個問題?
我的代碼如下:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from allabolag.items import AllabolagItem
from scrapy.loader.processors import Join
class allabolagspider(CrawlSpider):
name="allabolagspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
"http://www.allabolag.se/5565794400/befattningar"
]
rules = (
Rule(LinkExtractor(allow = "http://www.allabolag.se", restrict_xpaths=('//*[@id="printContent"]//a[1]')), callback='parse_link'),
)
def parse_link(self, response):
for sel in response.xpath('//*[@id="printContent"]'):
item = AllabolagItem()
item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
yield item
抓取時是否出現錯誤? – Rahul