2015-11-09 135 views
1

我有以下pandas.DataFrame對象:優雅的groupby和熊貓更新?

 offset      ts    op time 
0 0.000000 2015-10-27 18:31:40.318  Decompress 2.953 
1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015 
32 0.000000 2015-10-27 18:31:40.318   Compress 17.135 
33 0.000000 2015-10-27 18:31:40.318  BuildIndex 19.494 
34 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.625 
35 0.000000 2015-10-27 18:31:40.318   Compress 16.970 
36 0.000000 2015-10-27 18:31:40.318  BuildIndex 18.954 
37 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.047 
38 0.000000 2015-10-27 18:31:40.318   Compress 16.017 
39 0.000000 2015-10-27 18:31:40.318  BuildIndex 17.814 
40 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.047 
77 4.960683 2015-10-27 18:36:37.959  Decompress 2.844 
78 4.960683 2015-10-27 18:36:37.959 DeserializeBond 0.000 
108 4.960683 2015-10-27 18:36:37.959   Compress 17.758 
109 4.960683 2015-10-27 18:36:37.959  BuildIndex 19.742 
110 4.960683 2015-10-27 18:36:37.959  InsertIndex 0.110 
111 4.960683 2015-10-27 18:36:37.959   Compress 16.267 
112 4.960683 2015-10-27 18:36:37.959  BuildIndex 18.111 
113 4.960683 2015-10-27 18:36:37.959  InsertIndex 0.062 

我想組由(offset, ts, op)領域,並總結time值:

df = df.groupby(['offset', 'ts', 'op']).sum() 

到目前爲止好:

            time 
offset ts      op      
0.000000 2015-10-27 18:31:40.318 BuildIndex  56.262 
           Compress   50.122 
           Decompress  2.953 
           DeserializeBond 0.015 
           InsertIndex  0.719 
4.960683 2015-10-27 18:36:37.959 BuildIndex  37.853 
           Compress   34.025 
           Decompress  2.844 
           DeserializeBond 0.000 
           InsertIndex  0.172 

問題是,我必須從BuildIndex減去Compress - 內每組I was recommended使用DataFrame.xs(),我想出了以下內容:

diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op") 
diff['op'] = 'BuildIndex' 
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val) 
df.update(diff) 

它的工作,但我有一個強烈的感覺,必須有一個更優雅的解決問題的方法。

有人可以建議一個更好的方法來做到這一點?

回答

1

注意:您行:

diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val) 

是不必要的,因爲DIFF是不變的(因爲它是唯一已經通過之前的分組)。


小劈是使用具有沿.values drop_levels=False(所以減去當索引被忽略),這是一個小面露因爲它假定每個組同時具有「BuildIndex」和「OP」行,這可能是不安全的。

In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values 

In [12]: diff 
Out[12]: 
            time 
offset  ts   op 
2015-10-27 18:31:40.318 BuildIndex 6.140 
      18:36:37.959 BuildIndex 3.828 

我會被誘惑到這裏拆散,因爲數據真的是二維的:

In [21]: res = df1.unstack("op") 

In [22]: res 
Out[22]: 
           time 
op      BuildIndex Compress Decompress DeserializeBond InsertIndex 
offset  ts 
2015-10-27 18:31:40.318  56.262 50.122  2.953   0.015  0.719 
      18:36:37.959  37.853 34.025  2.844   0.000  0.172 

目前還不清楚是否有在這是一個多指標列值,但:

In [23]: res.columns = res.columns.get_level_values(1) 

In [24]: res 
Out[24]: 
op      BuildIndex Compress Decompress DeserializeBond InsertIndex 
offset  ts 
2015-10-27 18:31:40.318  56.262 50.122  2.953   0.015  0.719 
      18:36:37.959  37.853 34.025  2.844   0.000  0.172 

那麼減法就容易多了:

In [25]: res["BuildIndex"] - res["Compress"] 
Out[25]: 
offset  ts 
2015-10-27 18:31:40.318 6.140 
      18:36:37.959 3.828 
dtype: float64 

In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"] 

我懷疑這是最優雅的...

+0

這是偉大的!非常感謝你的幫助。事實證明,你可以將多級列作爲元組來處理,並且在取消堆棧之後,只需編寫:'df ['time','BuildIndex'] - = df ['time','Compress']'。現在我很高興:-) –