2017-08-30 154 views
3

認爲以下表結合複雜的聚合功能GROUPBY

np.random.seed(42) 
ix = pd.date_range('2017-01-01', '2017-01-15', freq='60s') 
df = pd.DataFrame(
    { 
     'val': np.random.random(size=ix.shape[0]), 
     'active': np.random.choice([0,1], size=ix.shape[0]) 
    }, 
    index=ix 
) 
df.sample(10) 

產生的:

    active val 
2017-01-02 06:05:00 1 0.774654 
2017-01-04 08:15:00 1 0.934796 
2017-01-13 01:02:00 0 0.792351... 

我的目標是計算:每天

  • 總和
  • 每天活動次數總和
  • 每天這一個

總和爲straightforwards:每活躍的一天

gb = df.groupby(pd.to_datetime(df.index.date)) 
overall_sum_per_day = gb['val'].sum().rename('overall') 

琛這是一個有點麻煩(見this)。

active_sum_per_day = gb.agg(lambda x: x[x.active==1]['val'].sum())['val'].rename('active') 

我的問題我怎樣才能將二者結合起來。使用concat

pd.concat([overall_sum_per_day, active_sum_per_day], axis=1) 

我可以實現我的目標。但我無法一次實現,並一次應用這兩個聚合。可能嗎?看到這個comment

+0

檢查我的答案,看看如何清理你的groupby和應用函數。 –

回答

3

您可以使用GroupBy.apply

b = gb.apply(lambda x: pd.Series([x['val'].sum(), x.loc[x.active==1, 'val'].sum()], 
            index=['overall', 'active'])) 
print (b) 
       overall  active 
2017-01-01 715.997165 366.856234 
2017-01-02 720.101832 355.100828 
2017-01-03 711.247370 335.231948 
2017-01-04 713.688122 338.088299 
2017-01-05 716.127970 342.889442 
2017-01-06 697.319129 338.741027 
2017-01-07 708.121948 361.086977 
2017-01-08 731.032093 370.697884 
2017-01-09 718.386679 342.162494 
2017-01-10 709.706473 349.657514 
2017-01-11 720.477342 368.407343 
2017-01-12 738.286682 378.618305 
2017-01-13 735.805583 372.039108 
2017-01-14 727.502271 345.612816 
2017-01-15 0.613559 0.613559 

另一種解決方案:

b = gb.agg(lambda x: [x['val'].sum(), x.loc[x.active==1, 'val'].sum()]) 
     .rename(columns={'val':'overall'}) 
print (b) 
       active  overall 
2017-01-01 715.997165 366.856234 
2017-01-02 720.101832 355.100828 
2017-01-03 711.247370 335.231948 
2017-01-04 713.688122 338.088299 
2017-01-05 716.127970 342.889442 
2017-01-06 697.319129 338.741027 
2017-01-07 708.121948 361.086977 
2017-01-08 731.032093 370.697884 
2017-01-09 718.386679 342.162494 
2017-01-10 709.706473 349.657514 
2017-01-11 720.477342 368.407343 
2017-01-12 738.286682 378.618305 
2017-01-13 735.805583 372.039108 
2017-01-14 727.502271 345.612816 
2017-01-15 0.613559 0.613559 
3

IIUC我們能做到一步到位,與你原來的DF工作:

In [105]: df.groupby([df.index.normalize(), 'active'])['val'] \ 
    ...: .sum() \ 
    ...: .unstack(fill_value=0) \ 
    ...: .rename(columns={0:'overall', 1:'active'}) \ 
    ...: .assign(overall=lambda x: x['overall'] + x['active']) 
Out[105]: 
active   overall  active 
2017-01-01 715.997165 366.856234 
2017-01-02 720.101832 355.100828 
2017-01-03 711.247370 335.231948 
2017-01-04 713.688122 338.088299 
2017-01-05 716.127970 342.889442 
...    ...   ... 
2017-01-11 720.477342 368.407343 
2017-01-12 738.286682 378.618305 
2017-01-13 735.805583 372.039108 
2017-01-14 727.502271 345.612816 
2017-01-15 0.613559 0.613559 

[15 rows x 2 columns] 

說明:

In [64]: df.groupby([df.index.normalize(), 'active'])['val'].sum() 
Out[64]: 
      active 
2017-01-01 0   349.140931 
      1   366.856234 
2017-01-02 0   365.001004 
      1   355.100828 
2017-01-03 0   376.015422 
         ... 
2017-01-13 0   363.766475 
      1   372.039108 
2017-01-14 0   381.889455 
      1   345.612816 
2017-01-15 1   0.613559 
Name: val, Length: 29, dtype: float64 

In [65]: df.groupby([df.index.normalize(), 'active'])['val'].sum().unstack(fill_value=0) 
Out[65]: 
active    0   1 
2017-01-01 349.140931 366.856234 
2017-01-02 365.001004 355.100828 
2017-01-03 376.015422 335.231948 
2017-01-04 375.599823 338.088299 
2017-01-05 373.238528 342.889442 
...    ...   ... 
2017-01-11 352.069999 368.407343 
2017-01-12 359.668377 378.618305 
2017-01-13 363.766475 372.039108 
2017-01-14 381.889455 345.612816 
2017-01-15 0.000000 0.613559 

[15 rows x 2 columns] 
+1

你應該使用.assign與lambda而不是eval,這有點神奇 – Jeff

+0

@Jeff,好的,謝謝你的評論!當我回到我的筆記本上時,我會立即更改它(通過手機寫入) – MaxU

+0

@Jeff,使用'assign' - 我將如何訪問動態創建的列?帶有拉姆達的 – MaxU

1

我認爲這是更清潔的做pd.Grouper爲日期時間分組而建。你也可以定義一個清晰的函數。

def func(df): 
    active = (df['active'] * df['val']).sum() 
    overall = df['val'].sum() 
    return pd.Series(data=[active, overall], index=['active','overall']) 

df.groupby(pd.Grouper(freq='d')).apply(func) 

       active  overall 
2017-01-01 366.856234 715.997165 
2017-01-02 355.100828 720.101832 
2017-01-03 335.231948 711.247370 
2017-01-04 338.088299 713.688122 
2017-01-05 342.889442 716.127970 
2017-01-06 338.741027 697.319129 
2017-01-07 361.086977 708.121948 
2017-01-08 370.697884 731.032093 
2017-01-09 342.162494 718.386679 
2017-01-10 349.657514 709.706473 
2017-01-11 368.407343 720.477342 
2017-01-12 378.618305 738.286682 
2017-01-13 372.039108 735.805583 
2017-01-14 345.612816 727.502271 
2017-01-15 0.613559 0.613559 

你應該能夠與resampleapplythere is a bug執行此操作。

df.resample('d').apply(func) # should work but doens't produce correct output   

       active val 
2017-01-01 366.856234 NaN 
2017-01-02 355.100828 NaN 
2017-01-03 335.231948 NaN 
2017-01-04 338.088299 NaN 
2017-01-05 342.889442 NaN 
2017-01-06 338.741027 NaN 
2017-01-07 361.086977 NaN 
2017-01-08 370.697884 NaN 
2017-01-09 342.162494 NaN 
2017-01-10 349.657514 NaN 
2017-01-11 368.407343 NaN 
2017-01-12 378.618305 NaN 
2017-01-13 372.039108 NaN 
2017-01-14 345.612816 NaN 
2017-01-15 0.613559 NaN