2015-11-21 34 views
4

我嘗試實例化多個蜘蛛。第一個工作正常,但第二個給我一個錯誤:ReactorNotRestartable。在for循環中運行多個蜘蛛

feeds = { 
    'nasa': { 
     'name': 'nasa', 
     'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss', 
     'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss'] 
    }, 
    'xkcd': { 
     'name': 'xkcd', 
     'url': 'http://xkcd.com/rss.xml', 
     'start_urls': ['http://xkcd.com/rss.xml'] 
    }  
} 

通過以上的項目,我嘗試在一個循環中運行兩個蜘蛛,就像這樣:

from scrapy.crawler import CrawlerProcess 
from scrapy.spiders import XMLFeedSpider 

class MySpider(XMLFeedSpider): 

    name = None 

    def __init__(self, **kwargs): 

     this_feed = feeds[self.name] 
     self.start_urls = this_feed.get('start_urls') 
     self.iterator = 'iternodes' 
     self.itertag = 'items' 
     super(MySpider, self).__init__(**kwargs) 

def parse_node(self, response, node): 
    pass 


def start_crawler(): 
    process = CrawlerProcess({ 
     'USER_AGENT': CONFIG['USER_AGENT'], 
     'DOWNLOAD_HANDLERS': {'s3': None} # boto issues 
    }) 

    for feed_name in feeds.keys(): 
     MySpider.name = feed_name 
     process.crawl(MySpider) 
     process.start() 

第二循環的異常看起來像這樣,蜘蛛開,但隨後:

... 
2015-11-22 00:00:00 [scrapy] INFO: Spider opened 
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
Traceback (most recent call last): 
    File "env/bin/start_crawler", line 9, in <module> 
    load_entry_point('feed-crawler==0.0.1', 'console_scripts', 'start_crawler')() 
    File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py", line 51, in start_crawler 
    process.start() # the script will block here until the crawling is finished 
    File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start 
    reactor.run(installSignalHandlers=False) # blocking call 
    File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run 
    self.startRunning(installSignalHandlers=installSignalHandlers) 
    File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning 
    ReactorBase.startRunning(self) 
    File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning 
    raise error.ReactorNotRestartable() 
twisted.internet.error.ReactorNotRestartable 

我必須使第一個MySpider無效或我做錯了什麼,需要改變它的工作原理。提前致謝。

回答

0

解決辦法是,在剛剛結束一次收集循環中的蜘蛛和啓動過程。我的猜測,這與反應堆分配/釋放有關。

def start_crawler(): 

    process = CrawlerProcess({ 
     'USER_AGENT': CONFIG['USER_AGENT'], 
     'DOWNLOAD_HANDLERS': {'s3': None} # disable for issues with boto 
    }) 

    for feed_name in CONFIG['Feeds'].keys(): 
     MySpider.name = feed_name 
     process.crawl(MySpider) 

    process.start() 

感謝@ eLRuLL爲您的答案,它激勵我尋找這個解決方案。

0

看起來你必須實例每蜘蛛過程中,儘量:

def start_crawler():  

    for feed_name in feeds.keys(): 
     process = CrawlerProcess({ 
      'USER_AGENT': CONFIG['USER_AGENT'], 
      'DOWNLOAD_HANDLERS': {'s3': None} # boto issues 
     }) 
     MySpider.name = feed_name 
     process.crawl(MySpider) 
     process.start() 
+0

的確更有意義,但仍然是例外。 – rebeling

0

您可以在爬網中發送參數,並在解析過程中使用它們。

class MySpider(XMLFeedSpider): 
    def __init__(self, name, **kwargs): 
     super(MySpider, self).__init__(**kwargs) 

     self.name = name 


def start_crawler():  
    process = CrawlerProcess({ 
     'USER_AGENT': CONFIG['USER_AGENT'], 
     'DOWNLOAD_HANDLERS': {'s3': None} # boto issues 
    }) 

    for feed_name in feeds.keys(): 
     process.crawl(MySpider, feed_name) 

    process.start()