2017-06-20 53 views
2

這裏是我的基本scrapy履帶:嵌套JSON項目進行scrapy

def parse(self, response):   
    item = CruiseItem()  

    item['Cruise'] = {} 
    item['Cruise']['Cruiseline'] = response.xpath('//title/text()').extract() 
    item['Cruise']['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract() 
    item['Cruise']['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract() 
    item['Cruise']['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract() 

    return item 

這適用於所有我想正確的元素拉大。我的例如JSON提要原來以下:

[ 

{ 
    "Cruise": { 
     "Cruiseline": [ 
      "Ship Name" 
     ], 
     "Itinerary": [ 
      "3 Night Bahamas ", 
      "4 Night Western Caribbean ", 
      "4 Night Bahamas ", 
      "3 Night Bahamas ", 
      "5 Night Western Caribbean ", 
      "5 Night Eastern Caribbean ", 
      "7 Night Western Caribbean ", 
      "7 Night Southern Caribbean ", 
      "6 Night Western Caribbean ", 
      "7 Night Western Caribbean ", 
      "8 Night Eastern Caribbean " 
     ], 
     "Price": [ 
      "$169", 
      "$179", 
      "$289", 
      "$349", 
      "$359", 
      "$389", 
      "$389", 
      "$409", 
      "$424", 
      "$524", 
      "$939" 
     ], 
     "PerNight": [ 
      "$56/night", 
      "$45/night", 
      "$72/night", 
      "$116/night", 
      "$72/night", 
      "$78/night", 
      "$56/night", 
      "$58/night", 
      "$71/night", 
      "$75/night", 
      "$117/night" 
     ] 
    } 
} 
] 

目標JSON輸出卻不同:

[ 

{ 
    "Cruise": { 
     "Cruiseline": [ 
      "Ship Name" 
     ], 
     "Itinerary": [ 
      "3 Night Bahamas " 
     ], 
     "Price": [ 
      "$169" 
     ], 
     "PerNight": [ 
      "$56/night" 

     ] 
    }, 
    "Cruise": { 
     "Cruiseline": [ 
      "Ship Name" 
     ], 
     "Itinerary": [ 
      "4 Night Bahamas " 
     ], 
     "Price": [ 
      "$79" 
     ], 
     "PerNight": [ 
      "$86/night" 
     ] 
    } 
} 
] 

基本上我想,只有每個船,行程,價格的1回報每巡航路線,並每晚。

這是否有意義?很想討論

編輯:前幾天問過這個問題,但決定澄清並重新發布。謝謝!

回答

0

想通了。

def parse(self, response): 

    final_list = [] 

    item = WthItem() 

    item['ship'] = response.xpath('//*[@id="shipName1"]/text()').extract() 
    item['Itinerary'] = response.xpath('//*[@id="brochureName1"]/text()').extract() 
    item['Price'] = response.xpath('//*[@id="interiorPrice1"]/text()').extract() 
    item['PerNight'] = response.xpath('//*[@id="perNightinteriorPrice1"]/text()').extract() 

    final_list.append(item) 

    updated_list = [] 

    for item in final_list: 
     for i in range(len(item['ship'])): 
      sub_item = {} 
      sub_item['entry'] = {} 
      sub_item['entry']['ship'] = [item['ship'][i]] 
      sub_item['entry']['Itinerary'] = [item['Itinerary'][i]] 
      sub_item['entry']['Price'] = [item['Price'][i]] 
      sub_item['entry']['PerNight'] = [item['PerNight'][i]] 
      updated_list.append(sub_item) 

      print sub_item 

     return updated_list 
0

嘗試使用此腳本重新格式化數據。格式化後的數據將生活在updated_list

cruise_list = [ 

{ 
    "Cruise": { 
     "Cruiseline": [ 
      "Ship Name" 
     ], 
     "Itinerary": [ 
      "3 Night Bahamas ", 
      "4 Night Western Caribbean ", 
      "4 Night Bahamas ", 
      "3 Night Bahamas ", 
      "5 Night Western Caribbean ", 
      "5 Night Eastern Caribbean ", 
      "7 Night Western Caribbean ", 
      "7 Night Southern Caribbean ", 
      "6 Night Western Caribbean ", 
      "7 Night Western Caribbean ", 
      "8 Night Eastern Caribbean " 
     ], 
     "Price": [ 
      "$169", 
      "$179", 
      "$289", 
      "$349", 
      "$359", 
      "$389", 
      "$389", 
      "$409", 
      "$424", 
      "$524", 
      "$939" 
     ], 
     "PerNight": [ 
      "$56/night", 
      "$45/night", 
      "$72/night", 
      "$116/night", 
      "$72/night", 
      "$78/night", 
      "$56/night", 
      "$58/night", 
      "$71/night", 
      "$75/night", 
      "$117/night" 
     ] 
    } 
} 
] 

updated_list = [] 

for cruise_obj in cruise_list: 
    cruise_data = cruise_obj['Cruise'] 
    for i in range(len(cruise_data['Itinerary'])): 
     sub_item = {} 
     sub_item['Cruise'] = {} 
     sub_item['Cruise']['Cruiseline'] = cruise_data['Cruiseline'] 
     sub_item['Cruise']['Itinerary'] = [cruise_data['Itinerary'][i]] 
     sub_item['Cruise']['Price'] = [cruise_data['Price'][i]] 
     sub_item['Cruise']['PerNight'] = [cruise_data['PerNight'][i]] 
     updated_list.append(sub_item) 

一些其他的想法

  • 如果被存儲在你的JSON的唯一的東西是巡航對象,那麼Cruise初始密鑰是一種多餘的

  • 很多時候,你將東西存儲在不需要的數組中。我猜這是一個scrapy問題,但你應該嘗試修改我的腳本以刪除奇異值的數組。例如。巡航物體不應該有多個Cruiseline。讓我知道你是否需要幫助。

+0

感謝這一點,即時通信開放的嘗試你的想法,我可能需要一些幫助,再加工腳本但是 – Nathan

+0

這並不能真正幫助,我不知道這個更新的代碼應實行 – Nathan

+0

嗯,我如果我看不到您的整個代碼庫,那麼無法真正告訴您代碼應該放在哪裏。我假設'parse'在某處被多次調用,因爲到目前爲止您的最終數據是一個數組。所以基本上,找到你的json feed存儲在哪個變量中 - 比如說它叫做cruise_list,然後粘貼我的代碼。 (我的代碼只有在你調用你的數據變量'cruise_list'時才能工作,所以如果你的數據變量被稱爲'x',那麼在你的數據被聚合之前做一些類似'cruise_list = x'的事情,或者用'cruise_list ' – mjkaufer