2015-10-13 24 views
4

我想run scrapy from a single script運行scrapy更改設置,我想從settings.py所有設置,但我希望能夠改變一些人:如果能夠同時從腳本

from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 

process = CrawlerProcess(get_project_settings()) 

*### so what im missing here is being able to set or override one or two of the settings###* 


# 'followall' is the name of one of the spiders of the project. 
process.crawl('testspider', domain='scrapinghub.com') 
process.start() # the script will block here until the crawling is finished 

我無法使用this。我試過以下內容:

settings=scrapy.settings.Settings() 
settings.set('RETRY_TIMES',10) 

但它沒有工作。

注意:我正在使用最新版本的scrapy。

回答

4

因此,爲了覆蓋一些設置,一種方法是覆蓋/設置我們的腳本中的蜘蛛的靜態變量custom_settings。

所以我進口蜘蛛類,然後重寫custom_setting:

from testspiders.spiders.followall import FollowAllSpider 

FollowAllSpider.custom_settings={'RETRY_TIMES':10} 

所以這是整個腳本:

from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 
from testspiders.spiders.followall import FollowAllSpider 

FollowAllSpider.custom_settings={'RETRY_TIMES':10} 
process = CrawlerProcess(get_project_settings()) 


# 'followall' is the name of one of the spiders of the project. 
process.crawl('testspider', domain='scrapinghub.com') 
process.start() # the script will block here until the crawling is finished 
1

出於某種原因,上面的腳本並沒有對我的工作。相反,我寫了下面的內容,它可以工作。如果有人遇到同樣的問題,則發帖。

from scrapy.crawler import CrawlerProcess 
from scrapy.utils.project import get_project_settings 

process = CrawlerProcess(get_project_settings()) 
process.settings.set(
      'RETRY_TIMES', 10, priority='cmdline') 

process.crawl('testspider', domain='scrapinghub.com') 
process.start()