2013-07-04 63 views
6

我設計了一個爬行器,其中將有兩個蜘蛛。我使用scrapy設計了這些爬蟲。
這些蜘蛛將通過從數據庫中提取數據而獨立運行。scrapy中的端口錯誤

我們正在運行這些蜘蛛使用reactor.As我們知道,我們不能反覆運行反應堆
我們給了第500蜘蛛爬行的約500多個鏈接。 如果我們這樣做,我們有一個端口錯誤的問題。即scrapy只使用單個端口

Error caught on signal handler: <bound method ?.start_listening of <scrapy.telnet.TelnetConsole instance at 0x0467B440>> 
Traceback (most recent call last): 
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks 
result = g.send(result) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\core\engine.py", line 75, in start yield self.signals.send_catch_log_deferred(signal=signals.engine_started) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\signalmanager.py", line 23, in send_catch_log_deferred 
return signal.send_catch_log_deferred(*a, **kw) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\signal.py", line 53, in send_catch_log_deferred 
*arguments, **named) 
--- <exception caught here> --- 
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 137, in maybeDeferred 
result = f(*args, **kw) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\xlib\pydispatch\robustapply.py", line 47, in robustApply 
return receiver(*arguments, **named) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\telnet.py", line 47, in start_listening 
self.port = listen_tcp(self.portrange, self.host, self) 
File "C:\Python27\lib\site-packages\scrapy-0.16.5-py2.7.egg\scrapy\utils\reactor.py", line 14, in listen_tcp 
return reactor.listenTCP(x, factory, interface=host) 
File "C:\Python27\lib\site-packages\twisted\internet\posixbase.py", line 489, in listenTCP 
p.startListening() 
File "C:\Python27\lib\site-packages\twisted\internet\tcp.py", line 980, in startListening 
raise CannotListenError(self.interface, self.port, le) 
twisted.internet.error.CannotListenError: Couldn't listen on 0.0.0.0:6073: [Errno 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted. 

那麼這裏發生了什麼問題?那麼解決這種情況的最佳方法是什麼??請幫助...

p.s:我已經增加了端口的數量,但總是以6073爲默認值。

+0

你能展示你如何運行你的蜘蛛,你如何配置它們? – alecxe

+1

這是一個副本http://stackoverflow.com/questions/1767553/twisted-errors-in-scrapy-spider –

+0

@ Jean-PaulCalderone沒有不一樣的,我已禁用Web和Telnet控制檯,但它顯示相同的錯誤。 – sathish

回答

1

您的問題可以通過運行較少的併發爬網程序來解決。下面是我爲順序發出請求而寫的一個配方: 這個特定的類只運行一個爬行器,但是使它運行批處理(一次10個)所需的修改是微不足道的。

class SequentialCrawlManager(object): 
    """Start spiders sequentially""" 

    def __init__(self, spider, websites): 
     self.spider = spider 
     self.websites = websites 
     # setup crawler 
     self.settings = get_project_settings() 
     self.current_site_idx = 0 

    def next_site(self): 
     if self.current_site_idx < len(self.websites): 
      self.crawler = Crawler(self.settings) 
      # the CSVs data in each column is passed as keyword arguments 
      # the arguments come from the 
      spider = self.spider() # pass arguments if desired 
      self.crawler.crawl(spider) 
      self.crawler.start() 
      # wait for one spider to finish before starting the next one 
      self.crawler.signals.connect(self.next_site, signal=signals.spider_closed) 
      self.crawler.configure() 
      self.current_site_idx += 1 
     else: 
      reactor.stop() # required for the program to terminate 

    def start(self): 
     log.start() 
     self.next_site() 
     reactor.run() # blocking call