2015-05-22 220 views
6

每次我運行我的代碼我的IP被禁止。我需要幫助來延遲每個請求10秒鐘。我試着在代碼中放置DOWNLOAD_DELAY,但沒有任何結果。任何幫助表示讚賞。Scrapy延遲請求

# item class included here 
     class DmozItem(scrapy.Item): 
      # define the fields for your item here like: 
      link = scrapy.Field() 
      attr = scrapy.Field() 


     class DmozSpider(scrapy.Spider): 
      name = "dmoz" 
      allowed_domains = ["craigslist.org"] 
      start_urls = [ 
      "https://washingtondc.craigslist.org/search/fua" 
      ] 

      BASE_URL = 'https://washingtondc.craigslist.org/' 

      def parse(self, response): 
       links = response.xpath('//a[@class="hdrlnk"]/@href').extract() 
       for link in links: 
        absolute_url = self.BASE_URL + link 
        yield scrapy.Request(absolute_url, callback=self.parse_attr) 

      def parse_attr(self, response): 
       match = re.search(r"(\w+)\.html", response.url) 
       if match: 
        item_id = match.group(1) 
        url = self.BASE_URL + "reply/nos/vgm/" + item_id 

        item = DmozItem() 
        item["link"] = response.url 

        return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact) 

      def parse_contact(self, response): 
       item = response.meta['item'] 
       item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract()) 
       return item 
+0

您的要求time.sleep(10) – Ajay

+0

我應該在哪裏把time.sleep之前嘗試這個()到底是什麼? –

+0

可能是在這一行後,我猜 absolute_url = self.BASE_URL +鏈接 – Ajay

回答

10

你需要設置你的項目的DOWNLOAD_DELAY in settings.py。請注意,您可能還需要限制併發。默認情況下,併發性爲8,因此您使用8個併發請求訪問網站。

# settings.py 
DOWNLOAD_DELAY = 1 
CONCURRENT_REQUESTS_PER_DOMAIN = 2 

Scrapy 1.0開始,您也可以將自定義設置中的蜘蛛,所以你可以做這樣的事情:

class DmozSpider(Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", 
    ] 

    custom_settings = { 
     "DOWNLOAD_DELAY": 5, 
     "CONCURRENT_REQUESTS_PER_DOMAIN": 2 
    } 

延遲和併發性是每下載插槽而不是按要求設置。實際檢查什麼下載你有,你可以嘗試這樣的事情

def parse(self, response): 
    """ 
    """ 
    delay = self.crawler.engine.downloader.slots["www.dmoz.org"].delay 
    concurrency = self.crawler.engine.downloader.slots["www.dmoz.org"].concurrency 
    self.log("Delay {}, concurrency {} for request {}".format(delay, concurrency, response.request)) 
    return 
+0

只需要注意,即使在版本0.24中,也可以爲每個蜘蛛配置'download_delay',如鏈接到的URL所述:'您也可以通過設置download_delay spider屬性來更改每個蜘蛛的此設置。 – bosnjak