2014-09-26 36 views
1

Pandas "Group By" Query on Large Data in HDFStore?「集團通過」多個大型數據在HDFStore

列我曾嘗試在不同的是,我想能夠按兩列答案的例子。

基本上,修改代碼看起來像

with pd.get_store(fname) as store: 
    store.append('df',df,data_columns=['A','B','C']) 
    print "store:\n%s" % store 

    print "\ndf:\n%s" % store['df'] 

    # get the groups 
    groups = store.select_column('df',['A', 'B']).unique() 
    print "\ngroups:%s" % groups 

我曾嘗試選擇列A和B的多種方式,並不能得到它的工作。

則拋出Error KeyError異常: 「列[ 'A', 'B']表中未找到」

,是否支持?

謝謝

回答

2

store.select_column(...)只選擇一個SINGLE列。

稍微修改被鏈接的是原來的代碼:

import numpy as np 
import pandas as pd 
import os 

fname = 'groupby.h5' 

# create a frame 
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo', 
         'bar', 'bar', 'bar', 'bar', 
         'foo', 'foo', 'foo'], 
        'B': [1,1,1,2, 
         1,1,1,2, 
         2,2,1], 
        'C': ['dull', 'dull', 'shiny', 'dull', 
         'dull', 'shiny', 'shiny', 'dull', 
         'shiny', 'shiny', 'shiny'], 
        'D': np.random.randn(11), 
        'E': np.random.randn(11), 
        'F': np.random.randn(11)}) 


# create the store and append, using data_columns where I possibily 
# could aggregate 
with pd.get_store(fname,mode='w') as store: 
    store.append('df',df,data_columns=['A','B','C']) 

    print "\ndf:\n%s" % store['df'] 

    # get the groups 
    A = store.select_column('df','A') 
    B = store.select_column('df','B') 
    idx = pd.MultiIndex.from_arrays([A,B]) 
    groups = idx.unique() 

    # iterate over the groups and apply my operations 
    l = [] 
    for (a,b) in groups: 

     grp = store.select('df',where = [ 'A=%s and B=%s' % (a,b) ]) 

     # this is a regular frame, aggregate however you would like 
     l.append(grp[['D','E','F']].sum()) 

print "\nresult:\n%s" % pd.concat(l, keys = groups) 

os.remove(fname) 

下面是結果

的起始幀(從原來的例子爲B柱不同之處在於現在的整數,只是爲了清楚起見)

df: 
     A B  C   D   E   F 
0 foo 1 dull 0.993672 -0.889936 0.300826 
1 foo 1 dull -0.708760 -1.121964 -1.339494 
2 foo 1 shiny -0.606585 -0.345783 0.734747 
3 foo 2 dull -0.818121 -0.187682 -0.258820 
4 bar 1 dull -0.612097 -0.588711 1.417523 
5 bar 1 shiny -0.591513 0.661931 0.337610 
6 bar 1 shiny -0.974495 0.347694 -1.100550 
7 bar 2 dull 1.888711 1.824509 -0.635721 
8 foo 2 shiny 0.715446 -0.540150 0.789633 
9 foo 2 shiny -0.262954 0.957464 -0.042694 
10 foo 1 shiny 0.193822 -0.241079 -0.478291 

獨特的羣體。我們選擇了需要獨立分組的每一列,然後將結果索引和構建一個多索引。這些是由此產生的多指標的獨特組合。

groups:[('foo', 1) ('foo', 2) ('bar', 1) ('bar', 2)] 

最終結果。

result: 
foo 1 D -0.127852 
     E -2.598762 
     F -0.782213 
    2 D -0.365629 
     E 0.229632 
     F 0.488119 
bar 1 D -2.178105 
     E 0.420914 
     F 0.654583 
    2 D 1.888711 
     E 1.824509 
     F -0.635721 
dtype: float64