2016-10-03 72 views
0

我使用Scrapy布隆過濾器 10分鐘後,我對循環這樣的錯誤:布隆過濾器是在10分鐘後容量

2016-10-03 18:03:34 [twisted] CRITICAL: 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/twisted/internet/task.py", line 517, in _oneWorkUnit 
    result = next(self._iterator) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 63, in <genexpr> 
    work = (callable(elem, *args, **named) for elem in iterable) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output 
    self.crawler.engine.crawl(request=output, spider=spider) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 209, in crawl 
    self.schedule(request, spider) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 215, in schedule 
    if not self.slot.scheduler.enqueue_request(request): 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/core/scheduler.py", line 54, in enqueue_request 
    if not request.dont_filter and self.df.request_seen(request): 
    File "dirbot/custom_filters.py", line 20, in request_seen 
    self.fingerprints.add(fp) 
    File "/usr/local/lib/python2.7/dist-packages/pybloom/pybloom.py", line 182, in add 
    raise IndexError("BloomFilter is at capacity") 
IndexError: BloomFilter is at capacity 

的filter.py:

from pybloom import BloomFilter 
from scrapy.utils.job import job_dir 
from scrapy.dupefilters import BaseDupeFilter 

class BLOOMDupeFilter(BaseDupeFilter): 
    """Request Fingerprint duplicates filter""" 

    def __init__(self, path=None): 
     self.file = None 
     self.fingerprints = BloomFilter(2000000, 0.00001) 

    @classmethod 
    def from_settings(cls, settings): 
     return cls(job_dir(settings)) 

    def request_seen(self, request): 
     fp = request.url 
     if fp in self.fingerprints: 
      return True 
     self.fingerprints.add(fp) 

    def close(self, reason): 
     self.fingerprints = None 

我在谷歌上搜索一切可能性,但沒有任何工作。
感謝您的幫助。

回答

3

使用pybloom.ScalableBloomFilter而不是BloomFilter

from pybloom import ScalableBloomFilter 
from scrapy.utils.job import job_dir 
from scrapy.dupefilters import BaseDupeFilter 

class BLOOMDupeFilter(BaseDupeFilter): 
    """Request Fingerprint duplicates filter""" 

    def __init__(self, 
       path=None, 
       initial_capacity=2000000, 
       error_rate=0.00001, 
       mode=ScalableBloomFilter.SMALL_SET_GROWTH): 
     self.file = None 
     self.fingerprints = ScalableBloomFilter(
      initial_capacity, error_rate, mode) 
+0

是否需要像我以前的代碼一樣添加@classmethod? – Pixel

+0

@Pixel,添加任何你想要的! – skovorodkin