加速分組熊貓數據框中滾動平均值/標準差的計算

我有一個DataFrame，它有三列表示一個組，一組時間和一個值。我想計算每個組內的滾動方式，標準偏差等。現在我定義一個函數並使用apply。但是，對於非常大的數據集，這是非常緩慢的。以下是功能。加速分組熊貓數據框中滾動平均值/標準差的計算

def GetRollingMetrics(x, cols, windows, suffix): 
    for col in cols: 
     for win in windows: 
      x[col + '_' + str(win) + 'D' + '_mean' + '_' + suffix] = x.shift(1).rolling(win)[col].mean() 
      x[col + '_' + str(win) + 'D' + '_std' + '_' + suffix] = x.shift(1).rolling(win)[col].std() 
      x[col + '_' + str(win) + 'D' + '_min' + '_' + suffix] = x.shift(1).rolling(win)[col].min() 
      x[col + '_' + str(win) + 'D' + '_max' + '_' + suffix] = x.shift(1).rolling(win)[col].max() 

    return x

然後應用它，作爲一個例子，我用：

df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 3)), columns=['Group','Time','Value']) 
df.sort_values(by='Time', inplace=True) 
df = df.groupby('Group').apply(lambda x: GetRollingMetrics(x, ['Value'], [7,14,28], 'my_suffix'))

有沒有更「Pandaic」或有效的方式做到這一點？

來源

2017-08-30 user1566200

「Pandaic」... :-)另外，你想爲每列和每個窗口計算這些滾動狀態？ –

那麼，在這個例子中，我只有一個'Value'列，但是我可能想要爲多列和多個窗口大小計算它，因此cols是一個列表。 – user1566200

和'Pandaic'確實聽起來更好 - 編輯:) – user1566200

我不確定速度，但您肯定可以在df.apply這裏使用pd.concat。另外，您可以並行計算所有列的滾動統計量。您不必一次只做一列。

import pandas as pd 

df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 3)), 
        columns=['Group','Time','Value']) 
df.sort_values(by='Time', inplace=True) 

suffix = 'my_suffix' 
windows = [7, 14, 28] 
df = df.groupby('Group') 

d1 = pd.concat([df.rolling(w).mean()\ 
        .rename(columns=lambda x: x + '_' + str(w) + 'D_mean_' + suffix)\ 
       for w in windows] , 1) 
d2 = pd.concat([df.rolling(w).std()\ 
        .rename(columns=lambda x: x + '_' + str(w) + 'D_std_' + suffix) \ 
       for w in windows] , 1) 
d3 = pd.concat([df.rolling(w).min()\ 
        .rename(columns=lambda x: x + '_' + str(w) + 'D_min_' + suffix) \ 
       for w in windows] , 1) 
d4 = pd.concat([df.rolling(w).max()\ 
        .rename(columns=lambda x: x + '_' + str(w) + 'D_max_' + suffix) \ 
       for w in windows] , 1) 

out = pd.concat([d1, d2, d3, d4], 1)

性能

1 loop, best of 3: 9.9 s per loop

來源

2017-08-30 13:22:50

我重構你的函數使用agg()，所以我們可以在一杆準備好所有的數據，每個窗口：

def GetRollingMetrics(x, cols, windows, suffix): 
    for win in windows: 
     aggs = {col: ['mean', 'std', 'min', 'max'] for col in cols} 
     df = x.shift(1).rolling(win).agg(aggs) 
     # the real work is done, just copy the columns into x 
     for col in cols: 
      prefix = col + '_' + str(win) + 'D' 
      for stat in ('mean', 'std', 'min', 'max'): 
       x['_'.join((prefix, stat, suffix))] = df[col][stat] 
    return x

它如果您有多個列，速度會更快。如果你只有一列，它似乎不會更快。在for stat循環中有絕對的改進空間 - 複製需要大約一半的時間。也許你可以做一個重命名，也許以後連接結果？

如果你不顧一切地加速這一點，你應該考慮Numba，它可以讓你實現一次最小/最大/總和，然後你可以用它來進行所有的滾動計算。我之前已經完成了這個任務，並且可以在比現在花費更少的時間完成所有四個計算（因爲昂貴的部分將數據加載到緩存中）。

來源

2017-08-30 14:02:05

加速分組熊貓數據框中滾動平均值/標準差的計算

回答

相關問題