2014-01-08 47 views
1

特別是我希望按組進行一個系列中兩個日期之間差異的擴大平均值。所以,如果我有這樣的事情:如何在大熊貓中按組展開窗口

Period Group dates 
    1   A  2010-07-01 
    2   A  2010-07-13 
    3   A  2010-07-13 
    4   A  2010-07-21 
    1   B  2000-08-20 
    2   B  2000-08-15 

我會得到:

Period Group cumulative average of differences 
    1   A  0 
    2   A  12/2 
    3   A  12/3 
    4   A  20/4 
    1   B  0 
    2   B  -5/2 
+0

不應2 B值是-5/2?您最終會在幾天內尋找平均差異(作爲浮動)? – Jeff

回答

1

我有一個替代的解決方案比一個稍長其中之前已經發布,但我認爲它可能更容易理解日期列轉換函數內部發生了什麼,以及也輸出格式是一個位清潔器:

import numpy as np 
import pandas as pd 
from datetime import date 

# Build data 
prd = [1, 2, 3, 4, 1, 2] 
grp = ['A', 'A', 'A', 'A', 'B', 'B'] 
yr = [2010, 2010, 2010, 2010, 2000, 2000] 
mth = [7, 7, 7, 7, 8, 8] 
day = [1, 13, 13, 21, 20, 15] 
dt = [date(y, m, d) for y, m, d in zip(yr, mth, day)] 
# Create data frame 
df = pd.DataFrame({'Period': prd, 'Group': grp, 'Dates': dt}, 
        columns=['Period', 'Group', 'Dates']) 

# Transformation function for the date column 
def f(ser): 
    v = ser.values 
    # Get time difference in days 
    delta = [float((ii-v[0]).days) for ii in v] 
    # Get number of items to divide by 
    dv = np.arange(len(delta))+1 
    # Get cumulative average 
    cumavg = [nm/dm for nm, dm in zip(delta, dv)] 
    # Create output pandas Series object and return it 
    out = pd.Series(cumavg, index=ser.index) 
    return out 

# Apply the transformation function to the Dates column 
dfappend = pd.DataFrame({'Cum_Avg': df.groupby("Group").Dates.apply(f)}) 
# Delete the Dates column 
del df['Dates'] 
# Merge to create the revised data frame 
df = pd.merge(df, dfappend, left_index=True, right_index=True) 
print(df) 

的輸出是:

Period Group Cum_Avg 
0  1  A  0.0 
1  2  A  6.0 
2  3  A  4.0 
3  4  A  5.0 
4  1  B  0.0 
5  2  B  -2.5 
2
import pandas as pd 
import io 

data ="""Period Group dates 
1   A  2010-07-01 
2   A  2010-07-13 
3   A  2010-07-13 
4   A  2010-07-21 
1   B  2000-08-20 
2   B  2000-08-15""" 

df = pd.read_csv(io.BytesIO(data), delim_whitespace=True, parse_dates=[2]) 

def f(s): 
    t = s.diff().fillna(0).astype(np.int64) 
    return pd.expanding_mean(t).astype(np.int64).astype("timedelta64[ns]") 

r = df.groupby("Group").dates.apply(f) 
print r 

輸出:

0   00:00:00 
1 6 days, 00:00:00 
2 4 days, 00:00:00 
3 5 days, 00:00:00 
4   00:00:00 
5 -2 days, 12:00:00 
dtype: timedelta64[ns] 
+1

僅供參考將最終值轉換爲浮點數,您可以將結果除以np.timedelta(1,'D')'' – Jeff