2016-09-07 69 views
3

我想抓取一個包含2個部分的網站,而且我的腳本沒有我需要的那麼快。在scrapy中爲1個網站並行運行多個蜘蛛嗎?

是否有可能l 2只蜘蛛,一隻用於刮第一部分和第二部分的第二部分?

我想有2個不同勢類,並運行它們

scrapy crwal firstSpider 
scrapy crawl secondSpider 

,但我認爲這是不聰明。

我讀了documentation of scrapyd但我不知道它是否適合我的情況。

回答

5

我想你在找什麼是這樣的:

import scrapy 
from scrapy.crawler import CrawlerProcess 

class MySpider1(scrapy.Spider): 
    # Your first spider definition 
    ... 

class MySpider2(scrapy.Spider): 
    # Your second spider definition 
    ... 

process = CrawlerProcess() 
process.crawl(MySpider1) 
process.crawl(MySpider2) 
process.start() # the script will block here until all crawling jobs are finished 

你可以讀到更多在:running-multiple-spiders-in-the-same-process

+0

感謝的人,這正是我需要的 – parik

3

或者你可以運行這樣,你需要在與scrapy.cfg同一目錄中保存此代碼(我scrapy版本是1.3.3):

from scrapy.utils.project import get_project_settings 
from scrapy.crawler import CrawlerProcess 

setting = get_project_settings() 
process = CrawlerProcess(setting) 

for spider_name in process.spiders.list(): 
    print ("Running spider %s" % (spider_name)) 
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy 

process.start() 
+0

它也能工作,謝謝 – parik

+0

[twisted] CRITICAL:未處理的延遲錯誤: – zhilevan

1

更好的方法是(如果你有多個蜘蛛)它動態地獲取蜘蛛並運行它們。

from scrapy import spiderloader 
from scrapy.utils import project 
from twisted.internet.defer import inlineCallbacks 


@inlineCallbacks 
def crawl(): 
    settings = project.get_project_settings() 
    spider_loader = spiderloader.SpiderLoader.from_settings(settings) 
    spiders = spider_loader.list() 
    classes = [spider_loader.load(name) for name in spiders] 
    for my_spider in classes: 
     yield runner.crawl(my_spider) 
    reactor.stop() 

crawl() 
reactor.run() 

(第二方案): 因爲spiders.list()在Scrapy棄用1.4裕達解決方案應轉換爲類似

from scrapy.utils.project import get_project_settings 
from scrapy.crawler import CrawlerProcess 

setting = get_project_settings() 
spider_loader = spiderloader.SpiderLoader.from_settings(settings) 

for spider_name in spider_loader.list(): 
    print ("Running spider %s" % (spider_name)) 
    process.crawl(spider_name) 
process.start()