Scrapy從Python運行

我想從Python運行Scrapy。我在看這個代碼（source）：Scrapy從Python運行

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy.settings import Settings 
from scrapy import log 
from testspiders.spiders.followall import FollowAllSpider 

spider = FollowAllSpider(domain='scrapinghub.com') 
crawler = Crawler(Settings()) 
crawler.configure() 
crawler.crawl(spider) 
crawler.start() 
log.start() 
reactor.run() # the script will block here

我的問題是，我如何調整這個代碼來運行自己的蜘蛛困惑。我已經打電話給我的蜘蛛項目「spider_a」，它指定了要在蜘蛛本身內爬行的域。

我所問的是，如果我跑我的蜘蛛用下面的代碼：

scrapy crawl spider_a

如何調整上面的例子Python代碼做？

來源

2013-08-07 Jimmy

只需導入，並傳遞給crawler.crawl()，如：

from testspiders.spiders.spider_a import MySpider 

spider = MySpider() 
crawler.crawl(spider)

來源

2013-08-07 09:58:57 alecxe

以此方式運行將忽略用戶的設置。 – Medeiros

在Scrapy 0.19.x（可以與舊版本的工作），你可以做到以下幾點。

spider = FollowAllSpider(domain='scrapinghub.com') 
settings = get_project_settings() 
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
crawler.configure() 
crawler.crawl(spider) 
crawler.start() 
log.start() 
reactor.run() # the script will block here

你甚至可以直接從腳本像調用命令：

from scrapy import cmdline 
cmdline.execute("scrapy crawl followall".split()) #followall is the spider's name

拿上我的回答here看看。我changed官方documentation所以現在你的爬蟲使用你的設置，並可以產生輸出。

來源

2013-09-27 22:49:35 Medeiros

Scrapy從Python運行

回答

相關問題