2012-04-12 42 views
4

我有以下的數據幀多指標從陣列中的熊貓與非唯一數據

In[45]: data[:10] 
Out[45]: 
    Z A beta2 M  shell 
0 100 200 0.3112 197.2 -4.213 
1 100 200 -0.4197 202 -1.143 
2 100 200 0.03205 203 0  
3 100 201 0.2967 191 -4.434 
4 100 201 -0.4893 196.1 -4.691 
5 100 202 0.3084 183.4 -4.134 
6 100 202 -0.4873 188.2 -4.75 
7 100 202 -0.2483 188.4 -1.106 
8 100 203 0.3069 177.1 -4.355 
9 101 203 -0.4956 182.5 -5.217 

我的問題是,在這樣一種方式,我有(Z一多指標如何I組/變換數據, A)作爲索引(或MultiIndexes)考慮到數據不是唯一的?要清除我的目標,這是我所期望實現的:

   beta2[1] beta2[2] beta2[3] M[1] M[2] M[3] shell[1] shell[2] shell[3] 
    Z A 
0 100 200 0.3112 -0.4197 0.03205 197.2 202 203 -4.213  -1.143 0 
1 100 201 0.2967 0.4893 NaN  191 196.1 NaN -4.434  -4.691 NaN 
2 100 202 0.3084 -0.4873 NaN  183.4 188.2 NaN -4.134  -4.75  NaN 
3 100 203 0.3069 NaN  NaN  177.1 NaN NaN -4.355  NaN  NaN 
4 101 203 -0.4956 NaN  NaN  182.5 NaN NaN -5.217  NaN  NaN 

我明白,這涉及到的這些步驟之一至少兩個步驟,一個是獨特性,一個用於Z中的索引,A所以任何幫助是讚賞,也有一些數據結構可能更適合這個問題?

編輯:我已發現,線:

數據= data.set_index(( 'Z', 'A'))

解決Z中的索引的問題,一個。不幸的是,這隻適用於(Z,A)對是唯一的。

回答

6

我有一個開放的問題,對這類問題的工作:

https://github.com/pydata/pandas/issues/388

這裏是一個解決方案。首先簡單的(且不是非常有效的)函數來獲得組序號:

def group_position(*args): 
    """ 
    Get group position 
    """ 
    from collections import defaultdict 
    table = defaultdict(int) 

    result = [] 
    for tup in zip(*args): 
     result.append(table[tup]) 
     table[tup] += 1 

    return np.array(result) 

In [49]: group_position(df['Z'], df['A']) 
Out[49]: array([0, 1, 2, 0, 1, 0, 1, 2, 0, 0]) 

現在使用這個作爲輔助索引變量和出棧:

In [52]: df 
Out[52]: 
    Z A beta2  M shell 
0 100 200 0.31120 197.2 -4.213 
1 100 200 -0.41970 202.0 -1.143 
2 100 200 0.03205 203.0 0.000 
3 100 201 0.29670 191.0 -4.434 
4 100 201 -0.48930 196.1 -4.691 
5 100 202 0.30840 183.4 -4.134 
6 100 202 -0.48730 188.2 -4.750 
7 100 202 -0.24830 188.4 -1.106 
8 100 203 0.30690 177.1 -4.355 
9 101 203 -0.49560 182.5 -5.217 

In [53]: df['pos'] = group_position(df['Z'], df['A']) 

In [54]: df.set_index(['Z', 'A', 'pos']).unstack('pos') 
Out[54]: 
      beta2      M    shell    
pos   0  1  2  0  1  2  0  1  2 
Z A                  
100 200 0.3112 -0.4197 0.03205 197.2 202.0 203.0 -4.213 -1.143 0.000 
    201 0.2967 -0.4893  NaN 191.0 196.1 NaN -4.434 -4.691 NaN 
    202 0.3084 -0.4873 -0.24830 183.4 188.2 188.4 -4.134 -4.750 -1.106 
    203 0.3069  NaN  NaN 177.1 NaN NaN -4.355 NaN NaN 
101 203 -0.4956  NaN  NaN 182.5 NaN NaN -5.217 NaN NaN 

最後進行調整以獲得與您顯示的完全相同的結果:

In [61]: result = df.set_index(['Z', 'A', 'pos']).unstack('pos') 

In [62]: result.rename(columns=lambda x: '%s[%d]' % (x[0], x[1]+1)).reset_index() 
Out[62]: 
    Z A beta2[1] beta2[2] beta2[3] M[1] M[2] M[3] shell[1] shell[2] shell[3] 
0 100 200 0.3112 -0.4197 0.03205 197.2 202.0 203.0 -4.213 -1.143  0.000 
1 100 201 0.2967 -0.4893  NaN 191.0 196.1 NaN -4.434 -4.691  NaN 
2 100 202 0.3084 -0.4873 -0.24830 183.4 188.2 188.4 -4.134 -4.750 -1.106 
3 100 203 0.3069  NaN  NaN 177.1 NaN NaN -4.355  NaN  NaN 
4 101 203 -0.4956  NaN  NaN 182.5 NaN NaN -5.217  NaN  NaN