2016-03-02 44 views
3

我正在開發一個簡單的刮板來獲得9堵嘴帖子和它的圖像,但由於一些技術困難iam無法停止刮板,並繼續刮,我不想要。我想增加計數器的值,並停止後100個職位。 但是9gag頁面在每個響應中都以一種方式設計,它只給出10個帖子,在每次迭代之後,我的計數器值重置爲10,在這種情況下,我的循環無限長地運行並永不停止。如何在一定數量的請求後停止scrapy蜘蛛?


# -*- coding: utf-8 -*- 
import scrapy 
from _9gag.items import GagItem 

class FirstSpider(scrapy.Spider): 
    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = (
     'http://www.9gag.com/', 
    ) 

    last_gag_id = None 
    def parse(self, response): 
     count = 0 
     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      count +=1 
      if gag_id: 
       if (count != 100): 
        last_gag_id = gag_id[0] 
        ninegag_item = GagItem() 
        ninegag_item['entry_id'] = gag_id[0] 
        ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
        ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
        ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
        ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
        ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 

        yield ninegag_item 


       else: 
        break 


     next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id 
     yield scrapy.Request(url=next_url, callback=self.parse) 
     print count 

代碼items.py在這裏

from scrapy.item import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 

所以,我想增加一個全球性的計數值和嘗試這種傳遞3個參數解析功能提示錯誤

TypeError: parse() takes exactly 3 arguments (2 given) 

那麼,有沒有辦法通過一個全球計數值並在每次迭代後返回並在100個帖子後停止(假設)。

整個項目可以在這裏找到Github 即使我設置POST_LIMIT = 100的無限循環發生,在這裏看到的命令我執行

scrapy crawl first -s POST_LIMIT=10 --output=output.json 

回答

4

首先通過檢索命令傳遞:使用self.count和初始化的parse之外。然後,不要阻止對項目的解析,但會生成新的requests。請看下面的代碼:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 


class FirstSpider(scrapy.Spider): 

    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = ('http://www.9gag.com/',) 

    last_gag_id = None 
    COUNT_MAX = 30 
    count = 0 

    def parse(self, response): 

     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      ninegag_item = GagItem() 
      ninegag_item['entry_id'] = gag_id[0] 
      ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
      ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
      ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
      ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
      ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 
      self.last_gag_id = gag_id[0] 
      self.count = self.count + 1 
      yield ninegag_item 

     if (self.count < self.COUNT_MAX): 
      next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id 
      yield scrapy.Request(url=next_url, callback=self.parse) 
+0

有沒有辦法找到了完成刮的時候? –

+0

工作得很好Thankx @Frank –

0

count是本地parse()方法,所以它不會保留頁面之間。將所有發生的count更改爲self.count以使其成爲類的實例變量,並且它將在頁面之間保持不變。

0

蜘蛛參數使用-a option.check link

2

有一個內置的設置CLOSESPIDER_PAGECOUNT,可以通過命令行-s參數傳遞或改變設置:scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一個小小的警告,如果您啓用緩存,它也會計數緩存命中數。