Scrapy - 創建嵌套的JSON對象

我正在學習如何使用Scrapy，同時刷新我在Python中的知識？/來自學校的編碼。Scrapy - 創建嵌套的JSON對象

目前，我正在玩imdb top 250列表，但與JSON輸出文件掙扎。

我當前的代碼是：

# -*- coding: utf-8 -*- 
import scrapy 

from top250imdb.items import Top250ImdbItem 


class ActorsSpider(scrapy.Spider): 
    name = "actors" 
    allowed_domains = ["imdb.com"] 
    start_urls = ['http://www.imdb.com/chart/top'] 

    # Parsing each movie and preparing the url for the actors list 
    def parse(self, response): 
     for film in response.css('.titleColumn'): 
      url = film.css('a::attr(href)').extract_first() 
      actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast' 
      yield scrapy.Request(actors_url, self.parse_actor) 

    # Finding all actors and storing them on item 
    # Refer to items.py 
    def parse_actor(self, response): 
     final_list = [] 
     item = Top250ImdbItem() 
     item['poster'] = response.css('#main img::attr(src)').extract_first() 
     item['title'] = response.css('h3[itemprop~=name] a::text').extract() 
     item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract() 
     item['actors'] = response.css('td[itemprop~=actor] span::text').extract() 

     final_list.append(item) 

     updated_list = [] 

     for item in final_list: 
      for i in range(len(item['title'])): 
       sub_item = {} 
       sub_item['movie'] = {} 
       sub_item['movie']['poster'] = [item['poster']] 
       sub_item['movie']['title'] = [item['title'][i]] 
       sub_item['movie']['photo'] = [item['photo']] 
       sub_item['movie']['actors'] = [item['actors']] 
       updated_list.append(sub_item) 
      return updated_list

和我的輸出文件給我這個JSON組成：

[ 
    { 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Shawshank Redemption"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Tim Robbins","Morgan Freeman",...]]} 
    },{ 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Godfather"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]} 
    } 
]

但我正在尋找實現這一目標：

{ 
    "movies": [{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Shawshank Redemption", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Tim Robbins"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Morgan Freeman"},... 
    ] 
    },{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Godfather", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Marlon Brando"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Al Pacino"},... 
    ] 
    }] 
}

在我items.py文件中我有以下內容：

import scrapy 


class Top250ImdbItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 

    # Items from actors.py 
    poster = scrapy.Field() 
    title = scrapy.Field() 
    photo = scrapy.Field() 
    actors = scrapy.Field() 
    movie = scrapy.Field() 
    pass

我知道下面的事情：

我的結果不是爲了出來，網頁列表中的第一個電影永遠是第一次拍電影對我的輸出文件，但其餘的是不。我仍在努力。
我可以做同樣的事情，但使用Top250ImdbItem（），仍然瀏覽如何以更詳細的方式完成。
這可能不是我的JSON的完美佈局，歡迎提出建議，或者如果是，請告訴我，即使我知道沒有完美的方式或「唯一的方式」。
一些演員沒有照片，它實際上加載了不同的CSS選擇器。現在，我想避免伸手去看「無圖片縮略圖」，因此可以將這些項目留空。

例如：

{"photo": "", "name": "Al Pacino"}

來源

2017-07-18 ricardoNava

不要使用'（scrapy.Item）'使用'dict'與'電影開始：[] '。 – stovfl

嘿，@stovfl能否詳細說明一下。 – ricardoNava

Question: ... struggling with a JSON output file

Note: Can't use your ActorsSpider , get Error: Pseudo-elements are not supported.

# Define a `dict` **once** 
top250ImdbItem = {'movies': []} 

def parse_actor(self, response): 
    poster = response.css(... 
    title = response.css(... 
    photos = response.css(... 
    actors = response.css(... 

    # Assuming List of Actors are in sync with List of Photos 
    actors_list = [] 
    for i, actor in enumerate(actors): 
     actors_list.append({"name": actor, "photo": photos[i]}) 

    one_movie = {"poster": poster, 
       "title": title, 
       "actors": actors_list 
       } 

    # Append One Movie to Top250 'movies' List 
    top250ImdbItem['movies'].append(one_movie)

來源

2017-07-19 17:11:26 stovfl

好吧，我會檢查，它有點兒奇怪，你不能運行它，我實際上仍然使用完全相同的代碼，我也會檢查這個問題，並更新，看看你是否可以運行它，我會嘗試這些建議，實際上沒有照片和演員不同步，仍然搞清楚如何去做，但你的幫助其實很棒。 – ricardoNava

我是否應該將修改過的工作代碼作爲評論發佈在此處，編輯當前的代碼還是保留原樣？ – ricardoNava

[編輯]你的問題，並只添加更改的部分 – stovfl

Scrapy - 創建嵌套的JSON對象

回答

相關問題