2016-03-07 231 views
1

我正在做scrapy項目我想同時運行多個蜘蛛 這是腳本運行蜘蛛的代碼。我得到錯誤..怎麼辦從scrapy腳本運行多個蜘蛛

from spiders.DmozSpider import DmozSpider 
from spiders.CraigslistSpider import CraigslistSpider 

from scrapy import signals, log 
from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy.settings import Settings 

TO_CRAWL = [DmozSpider, CraigslistSpider] 


RUNNING_CRAWLERS = [] 

def spider_closing(spider): 
"""Activates on spider closed signal""" 
log.msg("Spider closed: %s" % spider, level=log.INFO) 
RUNNING_CRAWLERS.remove(spider) 
if not RUNNING_CRAWLERS: 
    reactor.stop() 

log.start(日誌等級= log.DEBUG) 的蜘蛛在TO_CRAWL: 設置=設置()

# crawl responsibly 
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)") 
crawler = Crawler(settings) 
crawler_obj = spider() 
RUNNING_CRAWLERS.append(crawler_obj) 

# stop reactor when spider closes 
crawler.signals.connect(spider_closing, signal=signals.spider_closed) 
crawler.configure() 
crawler.crawl(crawler_obj) 
crawler.start() 

塊進行處理,從而始終保持爲最後一條語句

reactor.run()

+0

可以提高你的代碼的格式? 你有什麼錯誤?你能提供一個回溯? –

回答

1

抱歉回答這個問題的精靈,但只是引起你的注意力scrapydscrapinghub(至少對於快速測試)。 reactor.run()(當你創建它時)將在單個CPU上運行任意數量的Scrapy實例。你想要這個副作用嗎?即使你看了scrapyd的代碼,他們也不會用一個線程運行多個實例,但它們的確是fork/spawn subprocesses

2

您需要類似下面的代碼。您可以輕鬆地從Scrapy文檔找到它:)

,你可以用它來運行你的蜘蛛首先效用 scrapy.crawler.CrawlerProcess。本課程將爲您啓動Twisted reactor ,配置日誌記錄和設置關閉處理程序。這個 類是所有Scrapy命令使用的類。

# -*- coding: utf-8 -*- 
import sys 
import logging 
import traceback 
from scrapy.crawler import CrawlerProcess 
from scrapy.conf import settings 
from scrapy.utils.project import get_project_settings 
from spiders.DmozSpider import DmozSpider 
from spiders.CraigslistSpider import CraigslistSpider 

SPIDER_LIST = [ 
    DmozSpider, CraigslistSpider 
] 

if __name__ == "__main__": 
    try: 
     ## set up the crawler and start to crawl one spider at a time 
     process = CrawlerProcess(get_project_settings()) 
     for spider in SPIDER_LIST: 
      process.crawl(spider) 
     process.start() 
    except Exception, e: 
     exc_type, exc_obj, exc_tb = sys.exc_info() 
     logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno)) 
     logging.info("Exception: %s" % str(traceback.format_exc())) 

參考文獻: http://doc.scrapy.org/en/latest/topics/practices.html

+0

謝謝你,但我認爲這在單處理器上運行。但是我有100000個域名的列表,我希望在AWS EC2中運行30個實例。如何在30個實例中執行域列表隊列以運行蜘蛛。一次有30個蜘蛛在這30個實例中運行。如何做 –

+0

您可以爲不同的實例製作獨立的腳本。每個實例運行一組您的蜘蛛。對不起,但我不太明白你的問題。 – hungneox