優雅的groupby和熊貓更新？

我有以下pandas.DataFrame對象：優雅的groupby和熊貓更新？

 offset      ts    op time 
0 0.000000 2015-10-27 18:31:40.318  Decompress 2.953 
1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015 
32 0.000000 2015-10-27 18:31:40.318   Compress 17.135 
33 0.000000 2015-10-27 18:31:40.318  BuildIndex 19.494 
34 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.625 
35 0.000000 2015-10-27 18:31:40.318   Compress 16.970 
36 0.000000 2015-10-27 18:31:40.318  BuildIndex 18.954 
37 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.047 
38 0.000000 2015-10-27 18:31:40.318   Compress 16.017 
39 0.000000 2015-10-27 18:31:40.318  BuildIndex 17.814 
40 0.000000 2015-10-27 18:31:40.318  InsertIndex 0.047 
77 4.960683 2015-10-27 18:36:37.959  Decompress 2.844 
78 4.960683 2015-10-27 18:36:37.959 DeserializeBond 0.000 
108 4.960683 2015-10-27 18:36:37.959   Compress 17.758 
109 4.960683 2015-10-27 18:36:37.959  BuildIndex 19.742 
110 4.960683 2015-10-27 18:36:37.959  InsertIndex 0.110 
111 4.960683 2015-10-27 18:36:37.959   Compress 16.267 
112 4.960683 2015-10-27 18:36:37.959  BuildIndex 18.111 
113 4.960683 2015-10-27 18:36:37.959  InsertIndex 0.062

我想組由(offset, ts, op)領域，並總結time值：

df = df.groupby(['offset', 'ts', 'op']).sum()

到目前爲止好：

            time 
offset ts      op      
0.000000 2015-10-27 18:31:40.318 BuildIndex  56.262 
           Compress   50.122 
           Decompress  2.953 
           DeserializeBond 0.015 
           InsertIndex  0.719 
4.960683 2015-10-27 18:36:37.959 BuildIndex  37.853 
           Compress   34.025 
           Decompress  2.844 
           DeserializeBond 0.000 
           InsertIndex  0.172

問題是，我必須從BuildIndex減去Compress - 內每組。 I was recommended使用DataFrame.xs()，我想出了以下內容：

diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op") 
diff['op'] = 'BuildIndex' 
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val) 
df.update(diff)

它的工作，但我有一個強烈的感覺，必須有一個更優雅的解決問題的方法。

有人可以建議一個更好的方法來做到這一點？

來源

2015-11-09 Sergiy Matusevych

注意：您行：

diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)

是不必要的，因爲DIFF是不變的（因爲它是唯一已經通過之前的分組）。

小劈是使用具有沿.values drop_levels=False（所以減去當索引被忽略），這是一個小面露因爲它假定每個組同時具有「BuildIndex」和「OP」行，這可能是不安全的。

In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values 

In [12]: diff 
Out[12]: 
            time 
offset  ts   op 
2015-10-27 18:31:40.318 BuildIndex 6.140 
      18:36:37.959 BuildIndex 3.828

我會被誘惑到這裏拆散，因爲數據真的是二維的：

In [21]: res = df1.unstack("op") 

In [22]: res 
Out[22]: 
           time 
op      BuildIndex Compress Decompress DeserializeBond InsertIndex 
offset  ts 
2015-10-27 18:31:40.318  56.262 50.122  2.953   0.015  0.719 
      18:36:37.959  37.853 34.025  2.844   0.000  0.172

目前還不清楚是否有在這是一個多指標列值，但：

In [23]: res.columns = res.columns.get_level_values(1) 

In [24]: res 
Out[24]: 
op      BuildIndex Compress Decompress DeserializeBond InsertIndex 
offset  ts 
2015-10-27 18:31:40.318  56.262 50.122  2.953   0.015  0.719 
      18:36:37.959  37.853 34.025  2.844   0.000  0.172

那麼減法就容易多了：

In [25]: res["BuildIndex"] - res["Compress"] 
Out[25]: 
offset  ts 
2015-10-27 18:31:40.318 6.140 
      18:36:37.959 3.828 
dtype: float64 

In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]

我懷疑這是最優雅的...

來源

2015-11-10 03:48:21

這是偉大的！非常感謝你的幫助。事實證明，你可以將多級列作爲元組來處理，並且在取消堆棧之後，只需編寫：'df ['time'，'BuildIndex'] - = df ['time'，'Compress']'。現在我很高興:-) –

優雅的groupby和熊貓更新？

回答

相關問題