2015-02-10 86 views
1

我正在做關於使用Scrapy刮掉iTunes圖表的以下教程。 http://davidwalsh.name/python-scrape使用Scrapy刮掉iTunes圖表

該教程是有點過時,在一些使用Scrapy的當前版本已過時的語法(如HtmlXPathSelector,BaseSpider ..) - 我一直在努力與當前版本完成教程Scrapy,但沒有成功。

如果有人知道我做錯了什麼,很想了解我需要處理的事情。

items.py

from scrapy.item import Item, Field 

class AppItem(Item): 
    app_name = Field() 
    category = Field() 
    appstore_link = Field() 
    img_src = Field() 

apple_spider.py

import scrapy 
from scrapy.selector import Selector 

from apple.items import AppItem 

class AppleSpider(scrapy.Spider): 
    name = "apple" 
    allowed_domains = ["apple.com"] 
    start_urls = ["http://www.apple.com/itunes/charts/free-apps/"] 

    def parse(self, response): 
     apps = response.selector.xpath('//*[@id="main"]/section/ul/li') 
     count = 0 
     items = [] 

     for app in apps: 

      item = AppItem() 
      item['app_name'] = app.select('//h3/a/text()')[count].extract() 
      item['appstore_link'] = app.select('//h3/a/@href')[count].extract() 
      item['category'] = app.select('//h4/a/text()')[count].extract() 
      item['img_src'] = app.select('//a/img/@src')[count].extract() 

      items.append(item) 
      count += 1 

     return items 

這是運行scrapy crawl apple後,我的控制檯消息:

2015-02-10 13:38:12-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: apple) 
2015-02-10 13:38:12-0500 [scrapy] INFO: Optional features available: ssl, http11, django 
2015-02-10 13:38:12-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'apple.spiders', ' 
SPIDER_MODULES': ['apple.spiders'], 'BOT_NAME': 'apple'} 
2015-02-10 13:38:12-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, We 
bService, CoreStats, SpiderState 
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, Download 
TimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddle 
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, D 
ownloaderStats 
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMidd 
leware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled item pipelines: 
2015-02-10 13:38:13-0500 [apple] INFO: Spider opened 
2015-02-10 13:38:13-0500 [apple] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items 
/min) 
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2015-02-10 13:38:13-0500 [apple] DEBUG: Crawled (200) <GET http://www.apple.com/itunes/charts/free-a 
pps/> (referer: None) 
2015-02-10 13:38:13-0500 [apple] INFO: Closing spider (finished) 
2015-02-10 13:38:13-0500 [apple] INFO: Dumping Scrapy stats: 
     {'downloader/request_bytes': 236, 
     'downloader/request_count': 1, 
     'downloader/request_method_count/GET': 1, 
     'downloader/response_bytes': 13148, 
     'downloader/response_count': 1, 
     'downloader/response_status_count/200': 1, 
     'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 271000), 
     'log_count/DEBUG': 3, 
     'log_count/INFO': 7, 
     'response_received_count': 1, 
     'scheduler/dequeued': 1, 
     'scheduler/dequeued/memory': 1, 
     'scheduler/enqueued': 1, 
     'scheduler/enqueued/memory': 1, 
     'start_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 240000)} 
2015-02-10 13:38:13-0500 [apple] INFO: Spider closed (finished) 

預先感謝任何幫助/諮詢!

回答

1

在閱讀技術部分之前:確保您沒有違反iTunes使用條款。

所有的你的問題是parse()回調中:

  • 主要XPath是不正確的(有直接沒有ul元素下的section
  • 代替response.selector你可以直接使用response
  • 循環中的xpath表達式應該是上下文相關的

固定版本:

def parse(self, response): 
    apps = response.xpath('//*[@id="main"]/section//ul/li') 

    for app in apps: 
     item = AppItem() 
     item['app_name'] = app.xpath('.//h3/a/text()').extract() 
     item['appstore_link'] = app.xpath('.//h3/a/@href').extract() 
     item['category'] = app.xpath('.//h4/a/text()').extract() 
     item['img_src'] = app.xpath('.//a/img/@src').extract() 

     yield item 
+0

謝謝alecxe!更正是有道理的,但是你能簡單地解釋爲什麼'count'變量在修改代碼時被排除了嗎?如何在沒有循環計數器的情況下實現這一切(或者更好地說,循環計數器起什麼作用?)。 – ploo 2015-02-10 19:06:30

+1

@ploo當然,關鍵的一點是'response'基本上是一個選擇器,'app'也是'li'元素的選擇器。我們不需要循環索引或計數變量,只是因爲每次迭代時的'app'變量已經是特定的'li'元素,我們可以使用'xpath()'進行搜索。我在解釋事情上很糟糕,但希望這是有道理的。 – alecxe 2015-02-10 19:09:04

+0

我想我明白了 - 再次感謝您的及時迴應,非常有幫助!乾杯 – ploo 2015-02-10 19:12:07