添加延遲後500個請求scrapy

我開始2000的URL列表，我使用：添加延遲後500個請求scrapy

DOWNLOAD_DELAY = 0.25

對於控制要求的速度，但我也想加入正後一個更大的延遲要求。例如，我希望每個請求的延遲時間爲0.25秒，每個500個請求的延遲時間爲100秒。

編輯：

示例代碼：

import os 
from os.path import join 
import scrapy 
import time 

date = time.strftime("%d/%m/%Y").replace('/','_') 

list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',     
       'http://runrun.es/':'runrunes', 
       'http://www.noticierodigital.com/':'noticiero_digital', 
       'http://www.eluniversal.com/':'el_universal', 
       'http://www.el-nacional.com/':'el_nacional', 
       'http://globovision.com/':'globovision', 
       'http://www.talcualdigital.com/':'talcualdigital', 
       'http://www.maduradas.com/':'maduradas', 
       'http://laiguana.tv/':'laiguana', 
       'http://www.aporrea.org/':'aporrea'} 

root_dir = os.getcwd() 
output_dir = join(root_dir,'data/',date) 

class TestSpider(scrapy.Spider): 
    name = "news_spider" 
    download_delay = 1 

    start_urls = list_of_pages.keys() 

    def parse(self, response): 
     if not os.path.exists(output_dir): 
      os.makedirs(output_dir) 

     filename = list_of_pages[response.url] 
     print time.time() 
     with open(join(output_dir,filename), 'wb') as f: 
      f.write(response.body)

名單，在這種情況下，較短尚想法是一樣的。我希望每個請求都有一個延遲級別，每個'N'請求都有一個延遲級別。我沒有抓取鏈接，只保存了主頁面。

來源

2016-07-31 Luis Ramon Ramirez Rodriguez

幫助你，這將需要一些更多的代碼。 –

請發表[mcve]如果你想要一些好的方法一些幫助，否則這個問題是完全太寬的SO –

@DavidGomes添加代碼 –

你可以使用AutoThrottle extension來看看，它不會給你一個嚴格的延遲控制，而是有自己的算法，根據響應時間和併發請求的數量，減慢蜘蛛的速度。

如果您需要更多控制刮取過程的某些階段的延遲，您可能需要custom middleware或自定義擴展（類似於AutoThrottle - source）。

您也可以在運行中更改.download_delay attribute of your spider。順便說一下，這正是AutoThrottle擴展功能所做的 - 它是updates the .download_delay value on the fly。

一些相關的話題：

來源

2016-07-31 21:03:43 alecxe

添加延遲後500個請求scrapy

回答

相關問題