2015-09-10 67 views
6

這是scrapy的默認Dupefilter類方法request_seenScrapy - 檢索蜘蛛對象在dupefilter

class RFPDupeFilter(BaseDupeFilter): 

    def request_seen(self, request): 
     fp = self.request_fingerprint(request) 
     if fp in self.fingerprints: 
      return True 
     self.fingerprints.add(fp) 
     if self.file: 
      self.file.write(fp + os.linesep) 

雖然實現自定義dupefilter。我不能檢索這個類的spider對象不像其他scrapy中間件

有沒有什麼辦法可以知道這是哪個對象spider?所以我可以通過蜘蛛基礎上的蜘蛛定製它?

另外我不能只實現一個讀取網址並將其放入列表中的中間件&檢查重複項而不是自定義dupefilter。這是因爲我需要暫停/恢復抓取,需要scrapy默認情況下使用JOBDIR設置

回答

2

存儲請求指紋如果你真的想要的,一個解決方案可以覆蓋RFPDupeFilterrequest_seen方法簽名,使它收到2個參數(self, request, spider);因爲request_seen在裏面被調用,所以比你需要覆蓋scrapy Scheuler'senqueue_request方法。你可以創造新的調度和新dupefilter這樣的:

# /scheduler.py 

from scrapy.core.scheduler import Scheduler 


class MyScheduler(Scheduler): 

    def enqueue_request(self, request): 
     if not request.dont_filter and self.df.request_seen(request, self.spider): 
      self.df.log(request, self.spider) 
      return False 
     dqok = self._dqpush(request) 
     if dqok: 
      self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider) 
     else: 
      self._mqpush(request) 
      self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider) 
     self.stats.inc_value('scheduler/enqueued', spider=self.spider) 
     return True 

-

# /dupefilters.py 

from scrapy.dupefilters import RFPDupeFilter 


class MyRFPDupeFilter(RFPDupeFilter): 

    def request_seen(self, request, spider): 
     fp = self.request_fingerprint(request) 
     if fp in self.fingerprints: 
      return True 
     self.fingerprints.add(fp) 
     if self.file: 
      self.file.write(fp + os.linesep) 

     # Do things with spider 

,並設置在settings.py其路徑:

# /settings.py 

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter' 
SCHEDULER = 'myproject.scheduler.MyScheduler'