2014-01-16 118 views
3

我的刮板運行良好大約一個小時。一段時間後,我開始看到這些錯誤:Scrapy:未處理的錯誤

2014-01-16 21:26:06+0100 [-] Unhandled Error 
     Traceback (most recent call last): 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/crawler.py", line 93, in start 
      self.start_reactor() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/crawler.py", line 130, in start_reactor 
      reactor.run(installSignalHandlers=False) # blocking call 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 1192, in run 
      self.mainLoop() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 1201, in mainLoop 
      self.runUntilCurrent() 
     --- <exception caught here> --- 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent 
      call.func(*call.args, **call.kw) 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/utils/reactor.py", line 41, in __call__ 
      return self._func(*self._a, **self._kw) 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/core/engine.py", line 106, in _next_request 
      if not self._next_request_from_scheduler(spider): 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/core/engine.py", line 132, in _next_request_from_scheduler 
      request = slot.scheduler.next_request() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/core/scheduler.py", line 64, in next_request 
      request = self._dqpop() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/core/scheduler.py", line 94, in _dqpop 
      d = self.dqs.pop() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/queuelib/pqueue.py", line 43, in pop 
      m = q.pop() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/squeue.py", line 18, in pop 
      s = super(SerializableQueue, self).pop() 
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/queuelib/queue.py", line 157, in pop 
      self.f.seek(-size-self.SIZE_SIZE, os.SEEK_END) 
     exceptions.IOError: [Errno 22] Invalid argument 

什麼可能導致這種情況?我的版本是0.20.2。一旦我得到這個錯誤,scrapy就會停止做任何事情。即使我停止並再次運行它(使用JOBDIR目錄),它仍然會給我這些錯誤。我需要刪除作業目錄並重新開始,如果我需要擺脫這些錯誤。

+1

你有沒有使用相同的JOBDIR開始多爬? – Rolando

+0

@Rolando我可能有!你認爲我有辦法解決這個不好的狀態嗎? –

+0

@AlexanderSuraphel您可以嘗試使用'--pdb'來嘗試調試問題。然後,您可以按照Marcelo的建議來清除「JOBDIR」狀態。 – Rolando

回答

3

試試這個:

  • 確保您正在運行最新版本的Scrapy(電流:0.24)
  • 續文件夾內搜索,備份文件requests.seen
  • 後備份刪除scrapy作業文件夾
  • 再次使用JOBDIR =選項啓動爬網恢復
  • 停止抓取
  • 更換新創建requests.seen與先前備份
  • 開始抓取再次
+0

實際上,您並不需要重新啓動抓取工具,只需刪除requests.seen以外的文件即可。 –