2013-02-25 220 views
2

至於學習使用Scrapy的一部分,我已經嘗試抓取亞馬遜並沒有同時刮數據的問題,Scrapy - 抓取和刮網站

我的代碼的輸出結果如下:

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> 
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13', 
       u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14', 
       u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15', 
       u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16', 
       u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17', 
       u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18', 
       u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19', 
       u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20', 
       u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21', 
       u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22', 
       u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23', 
       u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'], 
    'title': [u'ObamaCare Survival Guide', 
       u'The Official SAT Study Guide, 2nd edition', 
       u'Inferno: A Novel (Robert Langdon)', 
       u'A Memory of Light (Wheel of Time)', 
       u'Jesus Calling: Enjoying Peace in His Presence', 
       u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy', 
       u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set', 
       u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health', 
       u'Publication Manual of the American Psychological Association, 6th Edition', 
       u'The One and Only Ivan', 
       u'Inquebrantable (Spanish Edition)'], 
    'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633', 
    'visit_status': 'new'} 

但是,我想要的輸出被捕獲這樣,

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> 
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'], 
    'title': [u'ObamaCare Survival Guide']} 

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155> 
    {'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'], 
    'title': [u'The Official SAT Study Guide, 2nd edition']} 

我認爲它不是一個問題與scrapy或履帶,而是用FOR循環編寫。

以下是代碼,

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from Amaze.items import AmazeItem 

class AmazeSpider2(CrawlSpider): 
    name = "scanon" 
    allowed_domains = ["www.amazon.com"] 
    start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"] 

    rules = (
     Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True), 
     ) 

    def parse_items_1(self, response): 
     items = [] 
     print ('*** response:', response.url) 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select('//h3') 
     for title in titles: 
      item = AmazeItem() 
      item["title"] = title.select('//a[@class="title"]/text()').extract() 
      item["link"] = title.select('//a[@class="title"]/@href').extract() 
      print ('**parse-items_1:', item["title"], item["link"]) 
      items.append(item) 
     return items 

任何援助!

回答

3

問題是在你的Xpath

def parse_items_1(self, response): 
     items = [] 
     print ('*** response:', response.url) 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select('//h3') 
     for title in titles: 
      item = AmazeItem() 
      item["title"] = title.select('.//a[@class="title"]/text()').extract() 
      item["link"] = title.select('.//a[@class="title"]/@href').extract() 
      print ('**parse-items_1:', item["title"], item["link"]) 
      items.append(item) 
     return items 
在您需要XPath來看看使用 .以上的XPath

title只有其他明智的你的xpath會看起來整個頁面,所以它是wil升得配發的比賽,並回報他們,

+0

工作!謝謝你。 – Srikanth 2013-02-25 07:51:24

0

使用yield,使發電機和解決您的XPath選擇:

def parse_items_1(self, response): 
    hxs = HtmlXPathSelector(response) 
    titles = hxs.select('//h3') 

    for title in titles: 
     item = AmazeItem() 
     item["title"] = title.select('.//a[@class="title"]/text()').extract() 
     item["link"] = title.select('.//a[@class="title"]/@href').extract() 

     yield item