2017-03-29 83 views
2

我想寫我的第一個網絡爬蟲/數據提取使用scrapy,並不能得到它遵循鏈接。我也得到一個錯誤:Scrapy蜘蛛沒有下面的鏈接和錯誤

ERROR: Spider error processing < GET https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles >

我知道蜘蛛是一次掃描的頁面,因爲我能夠從a標籤和h1元素我是搞亂拉出信息。

有誰知道我可以如何使這個按照頁面上的鏈接和擺脫錯誤?

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from wikiCrawler.items import WikicrawlerItem 
from scrapy.spiders import Rule 


class WikispyderSpider(scrapy.Spider): 
    name = "wikiSpyder" 

    allowed_domains = ['https://en.wikipedia.org/'] 

    start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles'] 

    rules = (
     Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse"), 
    ) 

    def start_requests(self): 
     for url in self.start_urls: 
      yield scrapy.Request(url, callback=self.parse, dont_filter=True) 

    def parse(self, response): 
     items = [] 
     links = LinkExtractor(canonicalize=True, unique=True).extract_links(response) 
     for link in links: 
      item = WikicrawlerItem() 
      item['url_from'] = response.url 
      item['url_to'] = link.url 
      items.append(item) 
      print(items) 
     return items 

回答

1

如果你想使用鏈接提取器,你需要使用一個特殊的蜘蛛類 - CrawlSpider

from scrapy.spiders import CrawlSpider 

class WikispyderSpider(CrawlSpider): 
    # ... 

這裏是遵循從起始URL的鏈接並打印出一個簡單的蜘蛛頁標題:

from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider 

from scrapy.spiders import Rule 


class WikispyderSpider(CrawlSpider): 
    name = "wikiSpyder" 

    allowed_domains = ['en.wikipedia.org'] 
    start_urls = ['https://en.wikipedia.org/wiki/Wikipedia:Unusual_articles'] 

    rules = (
     Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse_link"), 
    ) 

    def parse_link(self, response): 
     print(response.xpath("//title/text()").extract_first()) 
+0

你真了不起,謝謝你的幫助! – Asuu