2017-09-08 255 views
0

我已經設置了履帶以這種方式:Scrapy - 運行蜘蛛多次

from twisted.internet import reactor 
from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 

def crawler(mood): 
     process = CrawlerProcess(get_project_settings()) 
     #crawl music selected by critics on the web 
     process.crawl('allmusic_{}_tracks'.format(mood), domain='allmusic.com') 
     # the script will block here until the crawling is finished 
     process.start() 
     #create containers for scraped data 
     allmusic = [] 
     allmusic_tracks = [] 
     allmusic_artists = [] 
     # #process pipelined files 
     with open('blogs/spiders/allmusic_data/{}_tracks.jl'.format(mood), 'r+') as t: 
      for line in t: 
       allmusic.append(json.loads(line)) 
     #fecth artists and their correspondant tracks 
     for item in allmusic: 
      allmusic_artists.append(item['artist']) 
      allmusic_tracks.append(item['track']) 
     return (allmusic_artists, allmusic_tracks) 

我可以像這樣運行:

artist_list, song_list = crawler('bitter') 
print artist_list 

,它工作正常。

,但如果我想連續運行幾次:

artist_list, song_list = crawler('bitter') 
artist_list2, song_list2 = crawler('harsh') 

我得到:

twisted.internet.error.ReactorNotRestartable

有一個簡單的方法來建立這種蜘蛛的包裝等等我可以多次運行它?

回答

0

這很簡單。

函數內已經定義了一個單獨的進程。

這樣,我就可以這樣做:

def crawler(mood1, mood2): 
     process = CrawlerProcess(get_project_settings()) 
     #crawl music selected by critics on the web 
     process.crawl('allmusic_{}_tracks'.format(mood1), domain='allmusic.com') 
     process.crawl('allmusic_{}_tracks'.format(mood2), domain='allmusic.com') 
     # the script will block here until the crawling is finished 
     process.start() 

前提是你必須爲每個進程已定義的類。