2016-05-26 100 views
0

望着scrapy統計(Crawled X pages (at X pages/min)),一旦,例如:在我看來,scrapy CONCURRENT_REQUESTS在DOWNLOAD_DELAY設置時被忽略?

DOWNLOAD_DELAY = 4.5 

設置的要求成爲順序無論CONCURRENT_REQUESTS設置是什麼。

從我的理解,不應該爲每個併發請求的延遲計數還是我誤解了scrapy體系結構?所以在我的例子不應該:

scrapy crawl us_al -a cid_range=000001..000020 

運行有10個併發請求更快,而不是在大約1分鐘50秒(請RANDOMIZE_DOWNLOAD_DELAY保持),這確實給我嗎?我將如何改變這種行爲?當沒有DOWNLOAD_DELAY查詢20個項目CONCURRENT_REQUESTS = 5需要4秒鐘和CONCURRENT_REQUESTS = 1 10秒的行爲對我更有意義。

這裏是蜘蛛的樣子:

import random 
import re 
import scrapy 

class UsAlSpider(scrapy.Spider): 
    name = "us_al" 
    allowed_domains = ["arc-sos.state.al.us"] 
    start_urls = [] 
    custom_settings = { 
     'CONCURRENT_REQUESTS': 10, 
     'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
     'DOWNLOAD_DELAY': 4.5 
    } 

    def __init__(self, cid_range=None, *args, **kwargs): 
     """ 
     Range (in the form: 000001..000010) 
     """ 
     super(UsAlSpider, self).__init__(*args, **kwargs) 

     self.cid_range = cid_range 

    def start_requests(self): 
     if self.cid_range and not re.search(r'^\d+\.\.\d+$', self.cid_range): 
      self.logger.error('Check input parameter cid_range={} needs to be in form cid_range=000001..000010'.format(self.cid_range)) 
      return 
     # crawl according to input option 
     id_range = self.cid_range.split('..') 
     shuffled_ids = ["{0:06}".format(i) for i in xrange(
      int(id_range[0]), int(id_range[1]) + 1)] 
     random.shuffle(shuffled_ids) 
     for id_ in shuffled_ids: 
      yield self.make_requests_from_url('http://arc-sos.state.al.us/cgi/corpdetail.mbr/detail?corp={}'.format(id_)) 

    def parse(self, response): 
     # parse the page info 

回答

1

CONCURRENT_REQUESTS僅僅是一個持有普通請求的方式,因此,如果您使用的任何其他設置(這通常是由域執行),沒有將CONCURRENT_REQUESTS設置爲較高的問題。

DOWNLOAD_DELAY由域名使用是正確的,因爲它背後的想法是不打一個特定的網站。這也影響CONCURRENT_REQUESTS_PER_DOMAIN,好像DOWNLOAD_DELAY>0 -> CONCURRENT_REQUESTS_PER_DOMAIN=1

+0

那麼'CONCURRENT_REQUESTS'是否適用於所有的蜘蛛,並且是整個scrapy框架的一般設置? – MrKaikev

+0

是的,沒錯 – eLRuLL