2013-01-14 97 views
1

我正在使用scrapy爲architonic * com的產品刮目錄頁面。但是,我想以每行csv顯示這些產品。在當前形勢下,從給定的分類頁面的所有品牌都在「品牌」一欄,而我想有一個像這樣的輸出:Scrapy:每個項目一行

{'brand': [u'Elisabeth Ellefsen'], 
    'title': [u'Up chair I 907'], 
    'img_url': [u'http://image.architonic.com/img_pro1-1/117/4373/t-up-06f-sq.jpg'], 
    'link': [u'http://www.architonic.com/pmsht/up-chair-tonon/1174373'] 
    } 

我試圖與項目裝載機打(添加default_output_processor = TakeFirst( )),添加'yield item'(查看註釋代碼)並搜索兩天以找到解決方案,但沒有運氣。希望有人願意幫助我。任何幫助真的很感激。

輸出看起來是這樣的:

2013-01-14 11:53:23+0100 [archi] DEBUG: Scraped from <200 http://www.architonic.com/pmpro/home-furnishings/3210002/2/2/3> 
{'brand': [u'Softline', 
      u'Elisabeth Ellefsen', 
      u'Sellex', 
      u'Lievore Altherr Molina', 
      u'Poliform', 
      ..... 
      u'Hans Thyge & Co.'], 
'img_url': [u'http://image.architonic.com/img_pro1-1/117/3661/terra-h-sq.jpg', 
      u'http://image.architonic.com/img_pro1-1/117/0852/fly-01-sq.jpg', 
      u'http://image.architonic.com/img_pro1-1/116/9870/ley-0004-sq.jpg', 
      u'http://image.architonic.com/img_pro1-1/117/1023/arflex-hollywood-03-sq.jpg', 
      ... 
      u'http://image.architonic.com/img_pro1-1/118/5357/reef-002-sq.jpg'], 
'link': [u'http://www.architonic.com/pmsht/terra-softline/1173661', 
      u'http://www.architonic.com/pmsht/fly-sellex/1170852', 
      u'http://www.architonic.com/pmsht/ley-poliform/1169870', 
      ..... 
      u'http://www.architonic.com/pmsht/reef-collection-labofa/1185357'], 
'title': [u'Terra', 
      u'Fly', 
      u'Ley chair', 
       ..... 
      u'Hollywood Sofa', 
      u'Pouff Round']} 

我在蜘蛛使用該/ archi_spider.py

import string 
import re 

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector     
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.utils.markup import remove_entities 
from archiscraper.items import ArchiItemFields, ArchiLoader 

class ArchiScraper(BaseSpider): 
    name = "archi" 
    allowed_domains = ["architonic.com"] 
    start_urls = ['http://www.architonic.com/pmpro/home-furnishings/3210002/2/2/%s' % page for page in xrange(1, 4)]  
    # rules = (Rule(SgmlLinkExtractor(allow=('.',),restrict_xpaths=('//*[@id="right_arrow"]',)) 
    #  , callback="parse_items", follow= True), 
    #  ) 
    #    
    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//li[contains(@class, "nav_pro_item")]') 
     items = [] 
     for site in sites: 
      item = ArchiLoader(ArchiItemFields(), site) 
      item.add_xpath('brand',  '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[1][self::text()]') 
      item.add_xpath('designer',  '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[3][self::text()]') 
      item.add_xpath('title',  '//*[contains(@class, "nav_pro_text")]/a/strong/text()')     
      item.add_xpath('img_url', '//li[contains(@class, "nav_pro_item")]/div/a/img/@src[1]')      
      item.add_xpath('link', '//*[contains(@class, "nav_pro_text")]/a/@href') 
      items.append(item.load_item())  
      return items 
      # for item in items: 
       # yield item 

items.py

# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/topics/items.html 
import string 
from scrapy.item import Item, Field 
from scrapy.contrib.loader.processor import MapCompose, Join, TakeFirst 
from scrapy.utils.markup import remove_entities 
from scrapy.contrib.loader import XPathItemLoader 

class ArchiItem(): 
    pass 

class ArchiItemFields(Item): 
    brand = Field() 
    title = Field() 
    designer = Field() 
    img_url = Field() 
    img = Field() 
    link = Field() 
    pass 

class ArchiLoader(XPathItemLoader): 
    # default_input_processor = MapCompose(unicode.strip) 
    # default_output_processor= TakeFirst() 

    brand_out = MapCompose(unicode.strip) 
    # title_out = Join()  
+1

請附上您的items.py文件,因爲這個代碼將不能沒有它運行。 :) – Talvalin

+0

謝謝,我添加items.py! :) – Joost

+0

CSV?但是你的數據結構是JSON。那麼,怎麼了?如果你想要單獨的項目列表,你應該首先獲取項目容器,並從中提取所需的數據。 – Denis

回答

0

只是返回旅遊項目清單後的兩端即

for site in sites: 
      item = ArchiLoader(ArchiItemFields(), site) 
      item.add_xpath('brand',  '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[1][self::text()]') 
      item.add_xpath('designer',  '//*[contains(@class, "nav_pro_text")]/a/br/following-sibling::node()[3][self::text()]') 
      item.add_xpath('title',  '//*[contains(@class, "nav_pro_text")]/a/strong/text()')     
      item.add_xpath('img_url', '//li[contains(@class, "nav_pro_item")]/div/a/img/@src[1]')      
      item.add_xpath('link', '//*[contains(@class, "nav_pro_text")]/a/@href') 
      items.append(item.load_item())  
return items 

希望這將有助於:)