2016-03-13 58 views
3

我在ipython中工作;我有一個Yaml文件和與我的Yaml文件對應的[thomas] id列表(thomas: - 文件第三行)。以下只是該文件的一小部分內容。完整的文件可以在這裏(https://github.com/108michael/congress-legislators/blob/master/legislators-historical.yaml從基於Python列表的yaml文件檢索數據

- id: 
    bioguide: C000858 
    thomas: '00246' 
    lis: S215 
    govtrack: 300029 
    opensecrets: N00002091 
    votesmart: 53288 
    icpsr: 14809 
    fec: 
    - S0ID00057 
    wikipedia: Larry Craig 
    house_history: 11530 
    name: 
    first: Larry 
    middle: E. 
    last: Craig 
    bio: 
    birthday: '1945-07-20' 
    gender: M 
    religion: Methodist 
    terms: 
    - type: rep 
    start: '1981-01-05' 
    end: '1983-01-03' 
    state: ID 
    district: 1 
    party: Republican 
    - type: rep 
    start: '1983-01-03' 
    end: '1985-01-03' 
    state: ID 
    district: 1 
    party: Republican 

我想分析的文件中找到,並在我的列表中的每個ID對應一個ID在[托馬斯:]我想要檢索以下內容:[FEC] :(可能有不止一種,我需要所有這些)[名稱:] [第一名:] [中:] [最後:]; [生物:] [生日:]; [條款:](可能有不止一個條款,我需要所有條款)[類型:] [開始:] [狀態:] [聚會:]。最後,也可能有fec數據不可用的情況。

1)我應該如何存儲數據?我對Python(我的第一種編程語言)還比較陌生,我不確定如何存儲數據。直覺上,我會說字典;然而,最重要的是訪問和數據檢索的簡易性。以前,我將相似的嵌套數據存儲爲csv。這種方法似乎有點笨重。如果我能夠從字典(我檢索的數據)中列出一個列表(來自我擁有的托馬斯ID),這似乎是理想的。

2)我不知道如何設置for/while語句,以便我只檢索對應於我的托馬斯id列表的數據。

我開始寫我所期望的將是代碼編寫的信息,以CSV:

import pandas as pd 
import yaml 
import glob 
import CSV 
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b'))) 

outputfile = open('sponsor_details', 'W', newline='') 
outputwriter = csv.writer(outputfile) 

df = df.drop_duplicates('sponsor_id') 
sponsor_list = df['sponsor_id'].tolist() 

with open('legislators-historical.yaml', 'r') as f: 
    data = yaml.load(f) 

    for sponsor in sponsor_list: 
     where sponsor == data[0]['thomas']: 
      x = data[0]['thomas'] 
      a = data[0]['name']['first'] 
      b = data[0]['name']['middle'] 
      c = data[0]['name']['last'] 
      d = data[0]['bio']['gender'] 
      e = data[0]['bio']['religion'] 

      for fec in data[0]['id']: 
       c = fec.get('fec')  

       for terms in data[0]['id']: 
        t = terms.get('type') 
        s = terms.get('start') 
        state = terms.get('state') 
        p = terms.get('party') 

    outputwriter.writerow([x, a, b, c, d, e, c, t, s, state, p]) 
    outputfile.flush() 

我收到以下錯誤:

--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-48-057d25de7e11> in <module>() 
    15 
    16  for sponsor in sponsor_list: 
---> 17   if sponsor == data[0]['thomas']: 
    18    x = data[0]['thomas'] 
    19    a = data[0]['name']['first'] 

KeyError: 'thomas' 
+0

也許有助於改變'在sponsor_list爲f贊助商:''要在SPO贊助商nsor_list:' – jezrael

+0

我剛試過你的建議和問題。我仍然收到以下錯誤:「文件」「,第17行 其中贊助==數據[0] [thomas]: ^ SyntaxError:無效的語法' –

+0

是的,它似乎不好太。但我從來沒有與'yaml'工作。也許有一種方法是將'yaml'轉換爲'json',然後使用'pd.read_json'來創建'DataFrame'。 – jezrael

回答

4

我想你可能會嘗試解析YAML並加載到數據幀時,它normalizing

import pandas as pd 
import yaml 

with open('legislators-historical.yaml', 'r') as f: 
    df = pd.io.json.json_normalize(yaml.load(f)) 

print(df.head()) 

輸出:

bio.birthday bio.gender bio.religion id.bioguide  id.fec id.govtrack \ 
0 1943-12-02   M Protestant  A000109 [S6CO00168]  300003 
1 1745-04-02   M   NaN  B000226   NaN  401222 
2 1742-03-21   M   NaN  B000546   NaN  401521 
3 1743-06-16   M   NaN  B001086   NaN  402032 
4 1730-07-22   M   NaN  C000187   NaN  402334 

    id.house_history id.icpsr id.lis id.opensecrets id.thomas id.votesmart \ 
0    8410  29108 S250  N00009082  00011   26783 
1    NaN  507 NaN   NaN  NaN   NaN 
2    9479  786 NaN   NaN  NaN   NaN 
3    10177  1260 NaN   NaN  NaN   NaN 
4    10687  1538 NaN   NaN  NaN   NaN 

    id.wikipedia name.first name.last name.middle \ 
0 Wayne Allard  Wayne Allard   A. 
1    NaN  Richard Bassett   NaN 
2    NaN Theodorick  Bland   NaN 
3 Aedanus Burke  Aedanus  Burke   NaN 
4 Daniel Carroll  Daniel Carroll   NaN 

               terms 
0 [{'party': 'Republican', 'type': 'rep', 'state... 
1 [{'party': 'Anti-Administration', 'type': 'sen... 
2 [{'end': '1791-03-03', 'district': 9, 'type': ... 
3 [{'end': '1791-03-03', 'district': 2, 'type': ... 
4 [{'end': '1791-03-03', 'district': 6, 'type': ... 

UPDATE

以下版本會過濾你的輸入數據,以便只記錄含有 「托馬斯」 和 「FEC」 將被處理:

#import ujson 
#import pprint as pp 
import yaml 
import pandas as pd 
from pandas.io.json import json_normalize 

def read_yaml(fn): 
    with open(fn, 'r') as fi: 
     return yaml.load(fi) 

def filter_data(data): 
    result_data = [] 
    for x in data: 
     if 'id' not in x: continue 
     if 'fec' not in x['id']: continue 
     if 'thomas' not in x['id']: continue 
     result_data.append(x) 
    return result_data 


fn = 'aaa.yaml' 


df = json_normalize(filter_data(read_yaml(fn)), 'terms', [['id', 'fec'], ['id', 'thomas']]) 
print(df.head()) 

df.to_csv('out.csv') 

輸出:

class district   end  party  start state type \ 
0 NaN   4 1993-01-03 Republican 1991-01-03 CO rep 
1 NaN   4 1995-01-03 Republican 1993-01-05 CO rep 
2 NaN   4 1997-01-03 Republican 1995-01-04 CO rep 
3  2  NaN 2003-01-03 Republican 1997-01-07 CO sen 
4  2  NaN 2009-01-03 Republican 2003-01-07 CO sen 

         url id.thomas  id.fec 
0      NaN  00011 S6CO00168 
1      NaN  00011 S6CO00168 
2      NaN  00011 S6CO00168 
3      NaN  00011 S6CO00168 
4 http://allard.senate.gov  00011 S6CO00168 

PS如你所見,這將複製你的行(參見:id.thomasid.fec),以便它可以顯示爲數據幀

UPDATE2

您可能還需要在「id.fec名單轉換成列,但我會做到這一點額外的數據幀:

df_fec = df['id.fec'].apply(pd.Series) 

print(df_fec.head()) 

輸出:

  0   1 
0 S8AR00112 H2AR01022 
1 S8AR00112 H2AR01022 
2 S8AR00112 H2AR01022 
3 S8AR00112 H2AR01022 
4 S6CO00168  NaN 
+0

謝謝Max。我嘗試了你的代碼(我使用Ipython)並得到了以下錯誤:ImportError:No module named'ujson'' –

+0

你可以安裝它'pip install ujson'或使用'json'而不是 – MaxU

+0

我安裝了ujson,這個錯誤:'----> 8 df = pd.io.json.json_normalize(ujson.dumps(yaml.load(f)))''AttributeError:'str'object has no attributes'values'' –