2016-11-23 60 views
3
均線

我有客戶銷售歷史的以下數據框(這只是其中的一部分,實際的數據幀超過70K行):熊貓:滾動時間加權與GROUPBY

import pandas as pd 
import datetime as DT 

df_test = pd.DataFrame({ 
    'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"], 
    'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6], 
    'Date' : [ 
     DT.datetime(2015,10,18), 
     DT.datetime(2015,11,14), 
     DT.datetime(2015,11,18), 
     DT.datetime(2015,12,13), 
     DT.datetime(2015,12,19), 
     DT.datetime(2016,1,24), 
     DT.datetime(2016,1,31), 
     DT.datetime(2016,2,17), 
     DT.datetime(2016,3,28), 
     DT.datetime(2016,3,31), 
     DT.datetime(2016,4,3),    
     DT.datetime(2016,4,16),    
    ]}) 

我想向數據框添加新列以顯示該客戶最近90天的時間加權平均值的結果。

預期結果(列Value_Result):

 Cus_ID Date Value Value_Result 
0 T313 2015-10-18 3.00   NaN  (No 90days history) 
1 T348 2015-11-14 2.00   NaN  (No 90days history) 
2 T313 2015-11-18 3.00   3  (3*31)/31 
3 T348 2015-12-13 4.00   2  (2*29)/29 
4 T313 2015-12-19 5.00   3  (3*62+3*31)/(62+31) 
5 T348 2016-01-24 3.00  2.743  (4*42+2*71)/(42+71) 
6 T329 2016-01-31 7.25   NaN  (No 90days history) 
7 T329 2016-02-17 10.25   7.25  (7.25*17)/17 
8 T348 2016-03-28 4.50   3  (3*64)/64 
9 T313 2016-03-31 11.75   NaN  (No 90days history) 
10 T329 2016-04-03 6.25  8.516  (10.25*46+7.25*63)/(46+63) 
11 T348 2016-04-16 6.00  3.279  (4.5*19+3*83)/(19+83) 

我試着使用groupby('Cus_ID')和滾動申請,但我有困難寫函數只考慮落後90天。

任何輸入高度讚賞。

+0

與[此問題]類似(http://stackoverflow.com/q/15771472/5276797)。一種選擇是每天重新採樣(這是公認的答案)。如果重採樣不是一個選項,另一個答案提供了一個特殊的功能來應用。 – IanS

回答

1

我不確定滾動功能將以加權平均的方式去,雖然也許別人知道如何使用它 我不能保證這將是最優化的方法,但它會產生你想要的結果,如果有必要的話,你可以採取這種做法並建立它。

非常感謝這個pbpython article。我建議通讀它。

我的方法是創建一個將應用於組(由Cus_ID組)的函數。該函數將遍歷該組中的行,並按照上面的描述進行加權平均,將其應用回組並返回組。這段代碼片段爲了清晰起見是詳細的,如果需要,可以通過刪除所有變量的創建來修剪它。

應用函數看起來像這樣

def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan): 
""" 
Will calculate the weighted (by day) time average of the group passed. 
It will not operate on the day it is evaulating but the previous days_back. 
Should be used with the apply() function in Pandas with groupby function 


Args: 
    group (pandas.DataFrame): Will be passed by pandas 
    value_col (str): Name of column with value to be averaged by weight 
    time_col (str): Name of column of with times in them 
    new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average 
    days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days 
    fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan 

Returns: 
    (pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no 
    time weight average exist 
""" 
for idx, row in group.iterrows(): 
    # Filter for only values that are days_back for averaging. 
    days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back)) 
    df = group[days_back_fil] 

    df['days-back'] = (row[time_col] - df[time_col])/np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back 
    df['weight'] = df[value_col] * df['days-back'] 

    try: 
     df['tw_avg'] = df['weight'].sum()/df['days-back'].sum() 
     time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg 
     group.loc[idx, new_col_name] = time_avg 
    except ZeroDivisionError: 
     group.loc[idx, new_col_name] = fill_value  

return group 

然後,您可以返回你這一行

df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date') 

這將產生尋找數據幀,

Cus_ID Date  Value time_weighted_average 
0 T313 2015-10-18 3.0 NaN 
1 T348 2015-11-14 2.0 NaN 
2 T313 2015-11-18 3.0 3.0 
3 T348 2015-12-13 4.0 2.0 
4 T313 2015-12-19 5.0 3.0 
5 T348 2016-01-24 3.0 2.743362831858407 
6 T329 2016-01-31 7.25 NaN 
7 T329 2016-02-17 10.25 7.25 
8 T348 2016-03-28 4.5 3.0 
9 T313 2016-03-31 11.75 NaN 
10 T329 2016-04-03 6.25 8.51605504587156 
11 T348 2016-04-16 6.0 3.2794117647058822 

你現在可以使用該功能將加權平均值應用於其他值列value_col參數或用days_back參數更改時間窗口長度。查看熊貓time deltas頁面,瞭解如何描述時間變化。

+0

嗨喬希,真的非常感謝!這真的是我需要的! – Thor