2014-06-24 103 views
1

我就類似於multiindexed數據幀運行groupby操作這一個:多指標大熊貓groupby,忽略一個級別?

         0   1 ... 
categories features subfeatures      
cat1  feature1 subfeature1 -0.224487 -0.227524 
        subfeature2 -0.591399 -0.799228 
      feature2 subfeature1 1.190110 -1.365895 ... 
        subfeature2 0.720956 -1.325562 
cat2  feature1 subfeature1 1.856932  NaN 
        subfeature2 -1.354258 -0.740473 
      feature2 subfeature1 0.234075 -1.362235 ... 
        subfeature2 0.013875 1.309564 
cat3  feature1 subfeature1  NaN  NaN 
        subfeature2 -1.260408 1.559721 ... 
      feature2 subfeature1 0.419246 0.084386 
        subfeature2 0.969270 1.493417 

...     ...    ... 

它可以使用下面的代碼生成:

import pandas as pd, numpy as np 
np.random.seed(seed=90) 
results = np.random.randn(3,2,2,2) 
results[2,0,0,:] = np.nan 
results[1,0,0,1] = np.nan 
results = results.reshape((-1,2)) 
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"], 
            ["feature1", "feature2"], 
            ["subfeature1", "subfeature2"]], 
            names=["categories", "features", "subfeatures"]) 
df = pd.DataFrame(results, index=index) 

我試圖只選擇組兩個子特徵陣列之間的最大差異大於某個閾值,但我遇到問題groupby

df.groupby(level=['categories','features']) 

這給了我以下組:

{('cat1', 'feature1'): [('cat1', 'feature1', 'subfeature1'), 
    ('cat1', 'feature1', 'subfeature2')], 
('cat1', 'feature2'): [('cat1', 'feature2', 'subfeature1'), 
    ('cat1', 'feature2', 'subfeature2')], 
('cat2', 'feature1'): [('cat2', 'feature1', 'subfeature1'), 
    ('cat2', 'feature1', 'subfeature2')], 
('cat2', 'feature2'): [('cat2', 'feature2', 'subfeature1'), 
    ('cat2', 'feature2', 'subfeature2')], 
('cat3', 'feature1'): [('cat3', 'feature1', 'subfeature1'), 
    ('cat3', 'feature1', 'subfeature2')], 
('cat3', 'feature2'): [('cat3', 'feature2', 'subfeature1'), 
    ('cat3', 'feature2', 'subfeature2')]} 

有沒有什麼辦法來組,以便子功能級別由groupby函數忽略?原因是我需要subfeature1subfeature2在一起,在分開的小組中它們毫無價值。

所以最好我想在groupby返回是這樣的:

{('cat1', 'feature1'): [('cat1', 'feature1')], 
('cat1', 'feature2'): [('cat1', 'feature2')], 
('cat2', 'feature1'): [('cat2', 'feature1')], 
('cat2', 'feature2'): [('cat2', 'feature2')], 
('cat3', 'feature1'): [('cat3', 'feature1')], 
('cat3', 'feature2'): [('cat3', 'feature2')], 

我怎麼能這樣做?

回答

1
In [20]: df.reset_index(level='subfeatures').groupby(level=['categories','features']).groups 
Out[20]: 
{('cat1', 'feature1'): [('cat1', 'feature1'), ('cat1', 'feature1')], 
('cat1', 'feature2'): [('cat1', 'feature2'), ('cat1', 'feature2')], 
('cat2', 'feature1'): [('cat2', 'feature1'), ('cat2', 'feature1')], 
('cat2', 'feature2'): [('cat2', 'feature2'), ('cat2', 'feature2')], 
('cat3', 'feature1'): [('cat3', 'feature1'), ('cat3', 'feature1')], 
('cat3', 'feature2'): [('cat3', 'feature2'), ('cat3', 'feature2')]} 
+0

有沒有可能在值之間有重複?例如,'('cat1','feature1')'在值列表中包含兩次。 – tlnagy

+0

你在做什麼?你幾乎不需要直接使用''.groups''。他們不是蠢人,每組有兩排。 – Jeff

+0

我正在比較子數列右側的數字數組。我想比較'subfeature1'數組和'subfeature2'數組每次(貓,特徵)組。 – tlnagy

0

在Jeff的幫助下,我設法找到了一個可行的解決方案。

def f(x): 
    tmp = x.set_index('subfeatures') 
    a = (tmp.xs('subfeature1')-tmp.xs('subfeature2')).abs().max() 
    return a > 1 

df.reset_index('subfeatures').groupby(level=['categories', 'features']).filter(f).set_index('subfeatures', append=True) 

我基本上忽略subfeatures進行分組,然後暫時補充它的過濾功能中回來,但是會丟失,所以我完成它的過濾功能完成後。