2014-01-26 72 views
1

我一直在嘗試遵循Scrapy教程(如在非常開始時)以及在項目頂層運行命令後(即scrapy.cfg的級別),我得到以下輸出:Scrapy教程(noob) - 0頁抓取

[email protected]:~/scrapy/tutorial$ scrapy crawl dmoz 
/usr/lib/pymodules/python2.7/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives): 
    BOT_VERSION: no longer used (user agent defaults to Scrapy now) 
    warnings.warn(msg, ScrapyDeprecationWarning) 
2014-01-26 04:17:06-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: tutorial) 
2014-01-26 04:17:06-0800 [scrapy] INFO: Optional features available: ssl, http11, boto, django 
2014-01-26 04:17:06-0800 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'tutorial.items.TutorialItem', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'USER_AGENT': 'tutorial/1.0', 'BOT_NAME': 'tutorial'} 
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-01-26 04:17:06-0800 [scrapy] INFO: Enabled item pipelines: 
2014-01-26 04:17:06-0800 [dmoz] INFO: Spider opened 
2014-01-26 04:17:06-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2014-01-26 04:17:06-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2014-01-26 04:17:06-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 
2014-01-26 04:17:07-0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 
2014-01-26 04:17:07-0800 [dmoz] INFO: Closing spider (finished) 
2014-01-26 04:17:07-0800 [dmoz] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 472, 
    'downloader/request_count': 2, 
    'downloader/request_method_count/GET': 2, 
    'downloader/response_bytes': 14888, 
    'downloader/response_count': 2, 
    'downloader/response_status_count/200': 2, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 1, 26, 12, 17, 7, 63261), 
    'log_count/DEBUG': 4, 
    'log_count/INFO': 7, 
    'response_received_count': 2, 
    'scheduler/dequeued': 2, 
    'scheduler/dequeued/memory': 2, 
    'scheduler/enqueued': 2, 
    'scheduler/enqueued/memory': 2, 
    'start_time': datetime.datetime(2014, 1, 26, 12, 17, 6, 567929)} 
2014-01-26 04:17:07-0800 [dmoz] INFO: Spider closed (finished) 
[email protected]:~/scrapy/tutorial$ 

(即0頁抓取在0 /第二!!!!!!!!!!!!!!)

故障到目前爲止: 1)經過語法items.py和dmoz_spider.py(複製粘貼和手動輸入) 2)在線檢查問題,但無法看到其他問題 3)已檢查的fol der結構等確保從正確的位置運行命令 4)升級到最新版本的scrapy

有什麼建議嗎?我的代碼恰恰是作爲例子

dmoz_spider.py是......

from scrapy.spider import Spider 

class DmozSpider(Spider): 
    name = "dmoz" 
    allowed_domains = ["dmoz.org"] 
    start_urls = [ 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
     "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
    ] 

    def parse(self, response): 
     filename = response.url.split("/")[-2] 
     open(filename, 'wb').write(response.body) 

和items.py ......

from scrapy.item import Item, Field 

class DmozItem(Item): 
    title = Field() 
    link = Field() 
    desc = Field() 
+0

你可能已經錯字的地方,在這裏你粘貼代碼 –

+0

所以。 ..在評論中發佈代碼並不適合我,所以我編輯了帖子@Guy – charliedontsurf

+1

您的蜘蛛抓取了2頁''response_received_count':2',並且由於您正在將頁面的HTML主體寫入本地文件,你應該讓他們在你的項目中使用HTML內容。但是你的蜘蛛沒有抓取任何物品(Scrapy教程的第一部分不是一個真正有用的用例)。繼續教程到http://doc.scrapy.org/en/latest/intro/tutorial.html#using-our-item –

回答

2

首先,你應該找出你想要抓取的東西

您將兩個啓動url傳遞給了scrapy,因此它會抓取它們,但無法找到更多要跟蹤的網址。

該頁面上的所有圖書鏈接都不符合allowed_domains dmoz.org

你可以做yield Request([next url])來抓取更多的鏈接,next url可以從響應中解析。

或繼承CrawlSpider並指定規則,如this example

1

此行反覆印刷,首先,當蜘蛛被打開,有一個在您的代碼沒有問題,你只是沒有實現任何其他尚未