2017-07-18 191 views
0

我正在學習如何使用Scrapy,同時刷新我在Python中的知識?/來自學校的編碼。Scrapy - 創建嵌套的JSON對象

目前,我正在玩imdb top 250列表,但與JSON輸出文件掙扎。

我當前的代碼是:

# -*- coding: utf-8 -*- 
import scrapy 

from top250imdb.items import Top250ImdbItem 


class ActorsSpider(scrapy.Spider): 
    name = "actors" 
    allowed_domains = ["imdb.com"] 
    start_urls = ['http://www.imdb.com/chart/top'] 

    # Parsing each movie and preparing the url for the actors list 
    def parse(self, response): 
     for film in response.css('.titleColumn'): 
      url = film.css('a::attr(href)').extract_first() 
      actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast' 
      yield scrapy.Request(actors_url, self.parse_actor) 

    # Finding all actors and storing them on item 
    # Refer to items.py 
    def parse_actor(self, response): 
     final_list = [] 
     item = Top250ImdbItem() 
     item['poster'] = response.css('#main img::attr(src)').extract_first() 
     item['title'] = response.css('h3[itemprop~=name] a::text').extract() 
     item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract() 
     item['actors'] = response.css('td[itemprop~=actor] span::text').extract() 

     final_list.append(item) 

     updated_list = [] 

     for item in final_list: 
      for i in range(len(item['title'])): 
       sub_item = {} 
       sub_item['movie'] = {} 
       sub_item['movie']['poster'] = [item['poster']] 
       sub_item['movie']['title'] = [item['title'][i]] 
       sub_item['movie']['photo'] = [item['photo']] 
       sub_item['movie']['actors'] = [item['actors']] 
       updated_list.append(sub_item) 
      return updated_list 

和我的輸出文件給我這個JSON組成:

[ 
    { 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Shawshank Redemption"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Tim Robbins","Morgan Freeman",...]]} 
    },{ 
    "movie": { 
     "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
     "title": ["The Godfather"], 
     "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
     "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]} 
    } 
] 

但我正在尋找實現這一目標:

{ 
    "movies": [{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Shawshank Redemption", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Tim Robbins"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Morgan Freeman"},... 
    ] 
    },{ 
    "poster": "https://images-na.ssl-images-amazon.com/poster...", 
    "title": "The Godfather", 
    "actors": [ 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Marlon Brando"}, 
     {"photo": "https://images-na.ssl-images-amazon.com/photo...", 
     "name": "Al Pacino"},... 
    ] 
    }] 
} 

在我items.py文件中我有以下內容:

import scrapy 


class Top250ImdbItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 

    # Items from actors.py 
    poster = scrapy.Field() 
    title = scrapy.Field() 
    photo = scrapy.Field() 
    actors = scrapy.Field() 
    movie = scrapy.Field() 
    pass 

我知道下面的事情:

  1. 我的結果不是爲了出來,網頁列表中的第一個電影永遠是第一次拍電影對我的輸出文件,但其餘的是不。我仍在努力。

  2. 我可以做同樣的事情,但使用Top250ImdbItem(),仍然瀏覽如何以更詳細的方式完成。

  3. 這可能不是我的JSON的完美佈局,歡迎提出建議,或者如果是,請告訴我,即使我知道沒有完美的方式或「唯一的方式」。

  4. 一些演員沒有照片,它實際上加載了不同的CSS選擇器。現在,我想避免伸手去看「無圖片縮略圖」,因此可以將這些項目留空。

例如:

{"photo": "", "name": "Al Pacino"} 
+0

不要使用'(scrapy.Item)'使用'dict'與'電影開始:[] '。 – stovfl

+0

嘿,@stovfl能否詳細說明一下。 – ricardoNava

回答

0

Question: ... struggling with a JSON output file


Note: Can't use your ActorsSpider , get Error: Pseudo-elements are not supported.

# Define a `dict` **once** 
top250ImdbItem = {'movies': []} 

def parse_actor(self, response): 
    poster = response.css(... 
    title = response.css(... 
    photos = response.css(... 
    actors = response.css(... 

    # Assuming List of Actors are in sync with List of Photos 
    actors_list = [] 
    for i, actor in enumerate(actors): 
     actors_list.append({"name": actor, "photo": photos[i]}) 

    one_movie = {"poster": poster, 
       "title": title, 
       "actors": actors_list 
       } 

    # Append One Movie to Top250 'movies' List 
    top250ImdbItem['movies'].append(one_movie) 
+0

好吧,我會檢查,它有點兒奇怪,你不能運行它,我實際上仍然使用完全相同的代碼,我也會檢查這個問題,並更新,看看你是否可以運行它,我會嘗試這些建議,實際上沒有照片和演員不同步,仍然搞清楚如何去做,但你的幫助其實很棒。 – ricardoNava

+0

我是否應該將修改過的工作代碼作爲評論發佈在此處,編輯當前的代碼還是保留原樣? – ricardoNava

+0

[編輯]你的問題,並只添加更改的部分 – stovfl