2013-12-20 30 views
1

我運行scrapy CrawlSpider與硒,我面臨一些奇怪的問題。蜘蛛爬行一會兒,然後凍結 - 似乎沒有做任何事情或卡住一點。 我一直遇到這個問題,所以爲了強制停止蜘蛛,我必須殺死PhantomJS驅動程序。我的蜘蛛在外部網站上漂亮地工作,但每次我在我的定製localhost網站上嘗試它時,蜘蛛都會凍結。下面是錯誤日誌:scrapy(或硒)被重定向到不同的網站後凍結

scrapy crawl image -o test.csv -t csv 
2013-12-19 18:12:43-0700 [scrapy] INFO: Scrapy 0.20.2 started (bot: cultr) 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Optional features available: ssl, http11 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE':   
'cultr.spiders', 'FEED_URI': 'test.csv', 'SPIDER_MODULES': ['cultr.spiders'], 'BOT_NAME':  
'cultr', 'USER_AGENT': 'cultr (+http://cultr.business.ualberta.ca)', 'FEED_FORMAT': 'csv'} 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, 
DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, 
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, 
OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 

2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-12-19 18:12:43-0700 [image] INFO: Spider opened 
2013-12-19 18:12:43-0700 [image] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items  
(at 0 items/min) 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-12-19 18:12:43-0700 [image] DEBUG: Crawled (200) <GET http://lh:8000/>  

(引用者:無)

2013-12-19 18:12:43-0700 [image] DEBUG: Visiting start of site:http://lh:8000/ 
2013-12-19 18:12:43-0700 [image] DEBUG: Parsing images for:http://lh:8000/ 
2013-12-19 18:12:44-0700 [image] DEBUG: Scraped from <200http://lh:8000/> 
{'AreaList': [36864], 
'CSSImagesList': [], 
'ImageIDList': [u':wdc:1387501964546'], 
'ImagesFileNames': [u'homepage-bcorp.png'], 
'ImagesList': [], 
'PositionList': [{'x': 8, 'y': 309}], 
'SiteUrl': u'http://localhosts:8000/', 
'WidthHeightList': [{'height': 192, 'width': 192}], 
'depth': 1, 
'domain': 'http://localhosts:8000', 
'htmlImagesList': [], 
'status': 'ok', 
'totalAreaOfImages': 36864, 
'totalNumberOfImages': 1} 

2013-12-19 18:13:33-0700 [image] ERROR: Spider error processing <GET 
http://<domain>:8000/pages/forbidden.html> 
Traceback (most recent call last): 
    File 

"/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib 

/python/twisted/internet/base.py", line 800, in runUntilCurrent 
    call.func(*call.args, **call.kw) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/ 
Extras/lib/python/twisted/internet/task.py", line 602, in _tick 
    taskObj._oneWorkUnit() 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/ 
Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit 
    result = self._iterator.next() 
    File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in  
<genexpr> 
    work = (callable(elem, *args, **named) for elem in iterable) 
--- <exception caught here> --- 
    File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in 
     iter_errback 
    yield next(it) 
    File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", 
    line 23, in process_spider_output 
    for x in result: 
    File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", 
    line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/Library/Python/2.7/site- 
    packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", 
    line 50, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 67, 
    in _parse_response 
    cb_res = callback(response, **cb_kwargs) or() 
    File "/Users/eddieantonio/Work/cultr/spider/cultr/spiders/ImageSpider.py", line 164, 
    in parse_images 
    driver.get(response.url) 
    File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
    line 176, in get 
    self.execute(Command.GET, {'url': url}) 
    File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
    line 162, in execute 
    response = self.command_executor.execute(driver_command, params) 
    File "/Library/Python/2.7/site- 
    packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute 
    return self._request(url, method=command_info[0], data=data) 
    File "/Library/Python/2.7/site- 
     packages/selenium/webdriver/remote/remote_connection.py", line 410, in _request 
    resp = opener.open(request) 
    File 
     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
    line 404, in open 
    response = self._open(req, data) 
    File 
     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
     line 422, in _open 
    '_open', req) 
    File 
    "/System/Library/Frameworks/Python.framework/Versions/2.7 
     /lib/python2.7/urllib2.py", 
     line 382, in _call_chain 
    result = func(*args) 
    File "/System/Library/Frameworks/Python.framework/Versions 
     /2.7/lib/python2.7/urllib2.py", line 1214, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/System/Library/Frameworks/Python.framework/Versions/ 
     2.7/lib/python2.7/urllib2.py", line 1187, in do_open 
    r = h.getresponse(buffering=True) 
    File "/System/Library/Frameworks/Python.framework/Versions/ 
     2.7/lib/python2.7/httplib.py", line 1045, in getresponse 
    response.begin() 
    File "/System/Library/Frameworks/Python.framework/Versions/ 
     2.7/lib/python2.7/httplib.py", line 409, in begin 
    version, status, reason = self._read_status() 
    File 
     "/System/Library/Frameworks/Python.framework/Versions/ 
     2.7/lib/python2.7/httplib.py", line 373, 
    in _read_status 
    raise BadStatusLine(line) 
httplib.BadStatusLine: '' 

回答

0

httplib.BadStatusLine表示:

如果一個服務器有HTTP狀態代碼,我們不明白響應募集。

我認爲當您抓取您的定製網站時返回錯誤。您應該使用scrapy外殼請求以獲取http://localhosts:8000/pages/forbidden.html以查看結果。