如何在一定數量的請求後停止scrapy蜘蛛？

我正在開發一個簡單的刮板來獲得9堵嘴帖子和它的圖像，但由於一些技術困難iam無法停止刮板，並繼續刮，我不想要。我想增加計數器的值，並停止後100個職位。但是9gag頁面在每個響應中都以一種方式設計，它只給出10個帖子，在每次迭代之後，我的計數器值重置爲10，在這種情況下，我的循環無限長地運行並永不停止。如何在一定數量的請求後停止scrapy蜘蛛？

# -*- coding: utf-8 -*- 
import scrapy 
from _9gag.items import GagItem 

class FirstSpider(scrapy.Spider): 
    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = (
     'http://www.9gag.com/', 
    ) 

    last_gag_id = None 
    def parse(self, response): 
     count = 0 
     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      count +=1 
      if gag_id: 
       if (count != 100): 
        last_gag_id = gag_id[0] 
        ninegag_item = GagItem() 
        ninegag_item['entry_id'] = gag_id[0] 
        ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
        ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
        ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
        ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
        ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 

        yield ninegag_item 


       else: 
        break 


     next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id 
     yield scrapy.Request(url=next_url, callback=self.parse) 
     print count

代碼items.py在這裏

from scrapy.item import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field()

所以，我想增加一個全球性的計數值和嘗試這種傳遞3個參數解析功能提示錯誤

TypeError: parse() takes exactly 3 arguments (2 given)

那麼，有沒有辦法通過一個全球計數值並在每次迭代後返回並在100個帖子後停止（假設）。

整個項目可以在這裏找到Github 即使我設置POST_LIMIT = 100的無限循環發生，在這裏看到的命令我執行

scrapy crawl first -s POST_LIMIT=10 --output=output.json

來源

2016-03-02 Backdoor Cipher

首先通過檢索命令傳遞：使用self.count和初始化的parse之外。然後，不要阻止對項目的解析，但會生成新的requests。請看下面的代碼：

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 


class FirstSpider(scrapy.Spider): 

    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = ('http://www.9gag.com/',) 

    last_gag_id = None 
    COUNT_MAX = 30 
    count = 0 

    def parse(self, response): 

     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      ninegag_item = GagItem() 
      ninegag_item['entry_id'] = gag_id[0] 
      ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
      ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
      ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
      ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
      ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 
      self.last_gag_id = gag_id[0] 
      self.count = self.count + 1 
      yield ninegag_item 

     if (self.count < self.COUNT_MAX): 
      next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id 
      yield scrapy.Request(url=next_url, callback=self.parse)

來源

2016-03-02 14:14:10

有沒有辦法找到了完成刮的時候？ –

工作得很好Thankx @Frank –

count是本地parse()方法，所以它不會保留頁面之間。將所有發生的count更改爲self.count以使其成爲類的實例變量，並且它將在頁面之間保持不變。

來源

2016-03-02 14:03:48 Steve

蜘蛛參數使用-a option.check link

來源

2016-03-02 14:03:55

有一個內置的設置CLOSESPIDER_PAGECOUNT，可以通過命令行-s參數傳遞或改變設置：scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一個小小的警告，如果您啓用緩存，它也會計數緩存命中數。

來源

2017-04-01 04:19:39 Dennis

如何在一定數量的請求後停止scrapy蜘蛛？

回答

相關問題