按計劃進行Scrapy

讓Scrapy按計劃運行正在驅使着我圍繞着Twist（ed）。按計劃進行Scrapy

我想下面的測試代碼的工作，但我得到一個twisted.internet.error.ReactorNotRestartable錯誤，當蜘蛛被觸發第二次：

from quotesbot.spiders.quotes import QuotesSpider 
import schedule 
import time 
from scrapy.crawler import CrawlerProcess 

def run_spider_script(): 
    process.crawl(QuotesSpider) 
    process.start() 


process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
}) 


schedule.every(5).seconds.do(run_spider_script) 

while True: 
    schedule.run_pending() 
    time.sleep(1)

我要去猜測，作爲CrawlerProcess，被扭曲的部分調用Reactor會再次啓動，如果不需要，則程序崩潰。有什麼辦法可以控制這個嗎？

同樣在這個階段，如果有另一種自動化Scrapy蜘蛛來按計劃運行的方法，我全都是耳朵。我試過scrapy.cmdline.execute，但未能得到這兩種循環：

from quotesbot.spiders.quotes import QuotesSpider 
from scrapy import cmdline 
import schedule 
import time 
from scrapy.crawler import CrawlerProcess 


def run_spider_cmd(): 
    print("Running spider") 
    cmdline.execute("scrapy crawl quotes".split()) 


process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
}) 


schedule.every(5).seconds.do(run_spider_cmd) 

while True: 
    schedule.run_pending() 
    time.sleep(1)

編輯

添加代碼，它使用雙絞線task.LoopingCall()運行測試蜘蛛每隔幾秒鐘。我是否完全錯誤地安排每天在同一時間運行的蜘蛛？

from twisted.internet import reactor 
from twisted.internet import task 
from scrapy.crawler import CrawlerRunner 
import scrapy 

class QuotesSpider(scrapy.Spider): 
    name = 'quotes' 
    allowed_domains = ['quotes.toscrape.com'] 
    start_urls = ['http://quotes.toscrape.com/'] 

    def parse(self, response): 

     quotes = response.xpath('//div[@class="quote"]') 

     for quote in quotes: 

      author = quote.xpath('.//small[@class="author"]/text()').extract_first() 
      text = quote.xpath('.//span[@class="text"]/text()').extract_first() 

      print(author, text) 


def run_crawl(): 

    runner = CrawlerRunner() 
    runner.crawl(QuotesSpider) 


l = task.LoopingCall(run_crawl) 
l.start(3) 

reactor.run()

來源

2017-05-28 itzafugazi

爲什麼不簡單地使用cron或systemd定時器？ – Granitosaurus

數據的網絡抓取只是預期應用程序的一部分，我希望將所有內容都作爲單個程序的一部分運行。但是，是的，如果我無法按照所述方式運行，我將使用OS任務計劃程序運行Scrapy腳本，其餘應用程序將分別運行。 – itzafugazi

首先值得一提的說法，有通常只有一個扭曲反應堆運行，它不是重新啓動（如你發現）。第二個是應該避免阻塞任務/函數（即time.sleep(n)），應該用異步替代方法替換（例如'reactor.task.deferLater（n，...）`）。

要從Twisted項目中有效使用Scrapy，需要scrapy.crawler.CrawlerRunner核心API而不是scrapy.crawler.CrawlerProcess。兩者之間的主要區別在於CrawlerProcess爲您運行Twisted的reactor（因此使其難以重啓反應堆），其中CrawlerRunner依賴於開發人員啓動反應堆。這裏是你的代碼可能看起來像CrawlerRunner：

from twisted.internet import reactor 
from quotesbot.spiders.quotes import QuotesSpider 
from scrapy.crawler import CrawlerRunner 

def run_crawl(): 
    """ 
    Run a spider within Twisted. Once it completes, 
    wait 5 seconds and run another spider. 
    """ 
    runner = CrawlerRunner({ 
     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 
     }) 
    deferred = runner.crawl(QuotesSpider) 
    # you can use reactor.callLater or task.deferLater to schedule a function 
    deferred.addCallback(reactor.callLater, 5, run_crawl) 
    return deferred 

run_crawl() 
reactor.run() # you have to run the reactor yourself

來源

2017-05-28 17:41:05

謝謝@ notorious.no，這已經開始爲我清理一些東西，但不幸的是我無法按計劃完成這項工作。我可能錯過了一些明顯的東西，但我不明白我將如何實現這一點，以便每天在特定時間運行一個蜘蛛。我能得到的最接近的是使用Twisted'task.LoopingCall（）'，我可以用它每86400秒運行一次蜘蛛來進行日常刮擦，但是我是否會以錯誤的方式解決這個問題？我已經使用循環代碼更新了我的帖子，非常感謝您的指導！ – itzafugazi

LoopingCall將工作正常，是最簡單的解決方案。您也可以修改示例代碼（即'addCallback（reactor.callLater，5，run_crawl）'），並用代表下一次抓取時間的秒數代替'5'。這會給你更多的精度，而不是'LoopingCall' –

Thanks @ notorious.no。我誤解了'deferred.addCallback'發生了什麼，這在調試中有點時間戳，並且開始有意義。這終究會爲我工作，非常感謝您的幫助！ – itzafugazi

按計劃進行Scrapy

回答

相關問題