2017-04-21 54 views
1

我想知道我可以分配給我的蜘蛛的start_urls數量是否有限制? 據我搜查,似乎沒有列表的限制文件。Scrapy,對start_url的限制

目前我已經設置了我的蜘蛛,以便從csv文件中讀取start_urls的列表。網站的數量約爲1,000,000。

回答

4

本身沒有限制,但您可能想自己限制它,否則可能最終導致內存問題。
可能發生的情況是,所有這些1M urls將被安排到scrapy調度程序,並且由於python對象比純字符串重得多,最終會導致內存不足。

爲了避免這種情況,你可以批量與spider_idle信號的起始網址:

class MySpider(Spider): 
    name = "spider" 
    urls = [] 
    batch_size = 10000 

    @classmethod 
    def from_crawler(cls, crawler, *args, **kwargs): 
     spider = cls(crawler, *args, **kwargs) 
     crawler.signals.connect(spider.idle_consume, signals.spider_idle) 
     return spider 

    def __init__(self, crawler): 
     self.crawler = crawler 
     self.urls = [] # read from file 

    def start_requests(self): 
     for i in range(self.batch_size): 
      url = self.urls.pop(0) 
      yield Request(url) 


    def parse(self, response): 
     pass 
     # parse 

    def idle_consume(self): 
     """ 
     Everytime spider is about to close check our urls 
     buffer if we have something left to crawl 
     """ 
     reqs = self.start_requests() 
     if not reqs: 
      return 
     logging.info('Consuming batch') 
     for req in reqs: 
      self.crawler.engine.schedule(req, self) 
     raise DontCloseSpider 
+0

謝謝Granitosaurus,幫助被感激:) – Taku