我一直在嘗試遵循Scrapy教程(如在非常開始時)以及在項目頂層運行命令後(即scrapy.cfg的級別),我得到以下輸出:Scrapy教程(noob) - 0頁抓取
[email protected]:~/scrapy/tutorial$ scrapy crawl dmoz
/usr/lib/pymodules/python2.7/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives):
BOT_VERSION: no longer used (user agent defaults to Scrapy now)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-26 04:17:06-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: tutorial)
2014-01-26 04:17:06-0800 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-01-26 04:17:06-0800 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'tutorial.items.TutorialItem', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'}
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled item pipelines:
2014-01-26 04:17:06-0800 [dmoz] INFO: Spider opened
2014-01-26 04:17:06-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-26 04:17:06-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-01-26 04:17:07-0800 [dmoz] INFO: Closing spider (finished)
2014-01-26 04:17:07-0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 472,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 14888,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 26, 12, 17, 7, 63261),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 1, 26, 12, 17, 6, 567929)}
2014-01-26 04:17:07-0800 [dmoz] INFO: Spider closed (finished)
[email protected]:~/scrapy/tutorial$
(即0頁抓取在0 /第二!!!!!!!!!!!!!!)
故障到目前爲止: 1)經過語法items.py和dmoz_spider.py(複製粘貼和手動輸入) 2)在線檢查問題,但無法看到其他問題 3)已檢查的fol der結構等確保從正確的位置運行命令 4)升級到最新版本的scrapy
有什麼建議嗎?我的代碼恰恰是作爲例子
dmoz_spider.py是......
from scrapy.spider import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
和items.py ......
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
你可能已經錯字的地方,在這裏你粘貼代碼 –
所以。 ..在評論中發佈代碼並不適合我,所以我編輯了帖子@Guy – charliedontsurf
您的蜘蛛抓取了2頁''response_received_count':2',並且由於您正在將頁面的HTML主體寫入本地文件,你應該讓他們在你的項目中使用HTML內容。但是你的蜘蛛沒有抓取任何物品(Scrapy教程的第一部分不是一個真正有用的用例)。繼續教程到http://doc.scrapy.org/en/latest/intro/tutorial.html#using-our-item –