2015-04-20 66 views
0

我有以下的列的數據幀:{'day','measurement'}重新取樣熊貓數據幀與係數

而且有可能在一天內多次測量(或根本沒有測量)

例如:

day  | measurement 
1  |  20.1 
1  |  20.9 
3  |  19.2 
4  |  20.0 
4  |  20.2 

和係數的數組: coef={-1:0.2, 0:0.6, 1:0.2}

我的目標是重新採樣d ata並使用係數求平均值(缺失的數據應該省略)。

這是我寫來計算

window=[-1,0,-1] 
df['resampled_measurement'][df['day']==d]=[coef[i]*df['measurement'][df['day']==d-i].mean() for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum() 
df['resampled_measurement'][df['day']==d]/=[coef[i] for i in window if df['measurement'][df['day']==d-i].shape[0]>0].sum() 

對於上面的示例代碼,輸出應該是:

Day measurement 
1 20.500 
2 19.850 
3 19.425 
4 19.875 

的問題是,代碼運行永遠和我很確定有更好的方法來重新採樣係數。

任何意見將不勝感激!

+0

能否請你幫我瞭解的相關性如何轉化到高於預期的輸出?我的理解是,例如,在第4天,你會希望'(0.2 * 19.2 + 0.6 * 20.1)/ 0.8'這是'19.875',而不是'19.97'。如果你能在第4天或第3天計算時走過,那會有幫助。 –

+0

我的錯誤,謝謝@SAnand –

+0

@UriGoren第2,3天的測量結果如預期的那樣準確?我想,你應該更新這些! – Zero

回答

2

這裏是一個可能的解決方案,你在找什麼:

 # This is your data 
In [2]: data = pd.DataFrame({ 
    ...:  'day': [1, 1, 3, 4, 4], 
    ...:  'measurement': [20.1, 20.9, 19.2, 20.0, 20.2] 
    ...: }) 

     # Pre-compute every day's average, filling the gaps 
In [3]: measurement = data.groupby('day')['measurement'].mean() 

In [4]: measurement = measurement.reindex(pd.np.arange(data.day.min(), data.day.max() + 1)) 

In [5]: coef = pd.Series({-1: 0.2, 0: 0.6, 1: 0.2}) 

     # Create a matrix with the time-shifted measurements 
In [6]: matrix = pd.DataFrame({key: measurement.shift(key) for key, val in coef.iteritems()}) 

In [7]: matrix 
Out[7]: 
     -1  0  1 
day 
1  NaN 20.5 NaN 
2 19.2 NaN 20.5 
3 20.1 19.2 NaN 
4  NaN 20.1 19.2 

     # Take a weighted average of the matrix 
In [8]: (matrix * coef).sum(axis=1)/(matrix.notnull() * coef).sum(axis=1) 
Out[8]: 
day 
1 20.500 
2 19.850 
3 19.425 
4 19.875 
dtype: float64