2016-04-04 54 views
1

我從here解析嵌套的JSON數據。此文件中的某些文件有多個與其關聯的committee_id。我需要與每個文件關聯的所有委員會。我不確定,但我想這意味着要爲每個committee_id寫一個新行。我的代碼如下:熊貓:從JSON文件檢索嵌套數據

import os.path 
import csv 
import json 

path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr' 
dirs = os.listdir(path) 
outputfile = open('df/h109_s_b', 'w', newline='')        
outputwriter = csv.writer(outputfile) 

for dir in dirs: 
    with open(path + "/" + dir + "/data.json", "r") as f: 
     data = json.load(f) 

     a = data['introduced_at'] 
     b = data['bill_id'] 
     c = data['sponsor']['thomas_id'] 
     d = data['sponsor']['state'] 
     e = data['sponsor']['name'] 
     f = data['sponsor']['type'] 
     i = data['subjects_top_term'] 
     j = data['official_title']    

     if data['committees']: 
      g = data['committees'][0]['committee_id'] 
     else: 
      g = "None"      
    outputwriter.writerow([a, b, c, d, e, f, g, i, j]) 
outputfile.close()  

我遇到的問題是,我的代碼只收集列出的第一個committee_id。例如,文件看起來是這樣的:

"committees": [ 
{ 
    "activity": [ 
    "referral", 
    "in committee" 
    ], 
    "committee": "House Transportation and Infrastructure", 
    "committee_id": "HSPW" 
}, 
{ 
    "activity": [ 
    "referral" 
    ], 
    "committee": "House Transportation and Infrastructure", 
    "committee_id": "HSPW", 
    "subcommittee": "Subcommittee on Economic Development, Public Buildings and Emergency Management", 
    "subcommittee_id": "13" 
}, 
{ 
    "activity": [ 
    "referral", 
    "in committee" 
    ], 
    "committee": "House Financial Services", 
    "committee_id": "HSBA" 
}, 
{ 
    "activity": [ 
    "referral" 
    ], 
    "committee": "House Financial Services", 


    "committee_id": "HSBA", 
    "subcommittee": "Subcommittee on Domestic and International Monetary Policy, Trade, and Technology", 
    "subcommittee_id": "19" 
} 

這是它是一個有點棘手,因爲我也想用committee_id相關的subcommittee_id結賬的時候被傳遞給一個小組委員會:

bill_iid committee subcommittee introduced at Thomas_id state name 
hr145-109 HSPW   na    "2005-01-4"   73  NY "McHugh, John M." 
hr145-109 HSPW   13    "2005-01-4"   73  NY "McHugh, John M." 
hr145-109 HSBA   na    "2005-01-4"   73  NY "McHugh, John M." 
hr145-109 HSBA   19    "2005-01-4"   73  NY "McHugh, John M." 

有什麼建議嗎?

回答

2

你能做到這樣:

In [111]: with open(fn) as f: 
    .....:  data = ujson.load(f) 
    .....: 

In [112]: committees = pd.io.json.json_normalize(data, 'committees') 

In [113]: committees 
Out[113]: 
      activity        committee committee_id       subcommittee subcommittee_id 
0   [referral]    House Energy and Commerce   HSIF          NaN    NaN 
1   [referral]    House Energy and Commerce   HSIF Subcommittee on Energy and Air Quality    03 
2   [referral]  House Education and the Workforce   HSED          NaN    NaN 
3   [referral]     House Financial Services   HSBA          NaN    NaN 
4   [referral]      House Agriculture   HSAG          NaN    NaN 
5 [referral, markup]       House Resources   HSII          NaN    NaN 
6   [referral]       House Science   HSSY          NaN    NaN 
7   [referral]      House Ways and Means   HSWM          NaN    NaN 
8   [referral] House Transportation and Infrastructure   HSPW          NaN    NaN 

UPDATE:,如果你想擁有所有的數據在一個DF你能做到這樣

import os 
import ujson 
import pandas as pd 

start_path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr' 

def get_merged_json(start_path): 
    return [ujson.load(open(os.path.join(path, f))) 
      for p, _, files in os.walk(start_path) 
      for f in files 
      if f.endswith('.json') 
      ] 

df = pd.read_json(ujson.dumps(data)) 

PS它會將所有committees作爲JSON數據放在一列中雖然

+0

ks再次MaxU!我有一個小問題:應該指出什麼?等等,我想我知道了。 'fn' ='文件名'。 –

+1

@MichaelPerdue,是的,它應該是完整或相對路徑到您的文件,包括其名稱 – MaxU

+0

我已經應用您的代碼有一個例外。我用ujson代替了json,因爲我得到了一個N​​ameError:name'ujson'沒有被定義爲 '。但是,它只返回一行。由於我正在使用'(path +「/」+ dir +「/data.json」,「r」)''我可以用它來解決它的問題,但是你會知道這是什麼嗎? –