2017-10-11 242 views
2

我想重複地拖延使用不同延遲的相同URL。在研究這個問題後,似乎相應的解決方案是使用類似於scrapy中的延遲請求

nextreq = scrapy.Request(url, dont_filter=True) 
d = defer.Deferred() 
delay = 1 
reactor.callLater(delay, d.callback, nextreq) 
yield d 

在解析。

但是,我一直無法做到這一點。我收到錯誤消息 ERROR: Spider must return Request, BaseItem, dict or None, got 'Deferred'

我不熟悉的扭曲,所以我希望我只是失去了一些東西明顯

是否有不打的框架這麼多的達到我的目標的更好的辦法?

回答

1

我終於找到了答案an old PR

def parse(): 
     req = scrapy.Request(...) 
     delay = 0 
     reactor.callLater(delay, self.crawler.engine.schedule, request=req, spider=self) 

然而,蜘蛛可以退出,由於閒置爲時尚早。基於過時的中間件https://github.com/ArturGaspar/scrapy-delayed-requests,這可以用

from scrapy import signals 
from scrapy.exceptions import DontCloseSpider 

class ImmortalSpiderMiddleware(object): 

    @classmethod 
    def from_crawler(cls, crawler): 
     s = cls() 
     crawler.signals.connect(s.spider_idle, signal=signals.spider_idle) 
     return s 

    @classmethod 
    def spider_idle(cls, spider): 
     raise DontCloseSpider() 

最後一個選項來彌補,通過ArturGaspar更新中間件,導致:

from weakref import WeakKeyDictionary 

from scrapy import signals 
from scrapy.exceptions import DontCloseSpider 
from twisted.internet import reactor 

class DelayedRequestsMiddleware(object): 
    requests = WeakKeyDictionary() 

    @classmethod 
    def from_crawler(cls, crawler): 
     ext = cls() 
     crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle) 
     return ext 

    @classmethod 
    def spider_idle(cls, spider): 
     if cls.requests.get(spider): 
      spider.log("delayed requests pending, not closing spider") 
      raise DontCloseSpider() 

    def process_request(self, request, spider): 
     delay = request.meta.pop('delay_request', None) 
     if delay: 
      self.requests.setdefault(spider, 0) 
      self.requests[spider] += 1 
      reactor.callLater(delay, self.schedule_request, request.copy(), 
           spider) 
      raise IgnoreRequest() 

    def schedule_request(self, request, spider): 
     spider.crawler.engine.schedule(request, spider) 
     self.requests[spider] -= 1 

,可以在解析中使用,如:

yield Request(..., meta={'delay_request': 5})