2017-06-26 76 views
0

我有兩個具有相同索引和列名稱的DatetimeIndexed DataFrame。並且每個大約826萬行和44列,DataFrames被加入,然後groupby以10分鐘的時間間隔被施加,給出大約6884個組。然後迭代匹配列對,爲每個組和列對返回單個值。有效地將函數應用於兩個pandas DataFrame

下面的解決方案在Xeon E5-2697 v3上運行並需要34分鐘,所有DataFrame都可以安裝在內存中。

我認爲應該有一個更有效的方式來計算這與兩個數據幀,也許使用Dask?

雖然我不清楚如何做一個Dask DataFrame的基於時間的groupby。

def circular_mean(burst_veldirection, burst_velspeed): 
    x = y = 0. 
    for angle, weight in zip(burst_veldirection.values, burst_velspeed.values): 
     x += math.cos(math.radians(angle)) * weight 
     y += math.sin(math.radians(angle)) * weight 

    mean = math.degrees(math.atan2(y, x)) 
    if mean < 0: 
     mean = 360 + mean 
    return mean 

def circ_mean(df): 
    results = [] 
    for x in range(0,45): 
     results.append(circular_mean(df[str(x)], df[str(x) + 'velspeed'])) 
    return results 

burst_veldirection_velspeed = burst_veldirection.join(burst_velspeed, rsuffix='velspeed') 

result = burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean) 

Example short HDF file containing第10000條記錄覆蓋23分鐘

+0

在800萬行,這個項目可能是足夠小,不需要DASK,如@ EFT指出。如果您要使用Dask,則可以使用Dask將一個數據幀轉換爲Dask數據幀,然後使用dask.dataframe.merge將Dask數據幀與vanilla Pandas數據幀合併來執行合併。 – kuanb

回答

2

這不會讓你遠離groupby,但只是從千方百計轉移到numpy的功能元素方面得到了大約8倍的速度提升爲了我。

burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean) 
Out[546]: 
    2017-01-01 00:00:00 [107.1417250368678, 256.8946560151866, 213.146... 
    2017-01-01 00:10:00 [26.33395947005812, 27.786466256197127, 94.898... 
    2017-01-01 00:20:00 [212.56183600787307, 284.77924347375733, 241.7... 
    2017-01-01 00:30:00 [302.1659401891579, 91.1768853178421, 194.9664... 
    2017-01-01 00:40:00 [90.29680187822757, 337.4345622590224, 302.219... 
    2017-01-01 00:50:00 [94.88722975883893, 319.5580499260627, 204.511... 
    2017-01-01 01:00:00 [133.4980653288851, 55.16669017531442, 20.7527... 
    2017-01-01 01:10:00 [356.67045637546113, 151.25258425458003, 200.1... 
    2017-01-01 01:20:00 [350.2489907863962, 33.284286840600046, 145.66... 
    2017-01-01 01:30:00 [135.74199444105565, 62.66259615135012, 257.80... 
    Freq: 10T, dtype: object 

burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2) 
Out[547]: 
    2017-01-01 00:00:00 [107.1417236328125, 256.8946533203125, 213.146... 
    2017-01-01 00:10:00 [26.333953857421875, 27.78646469116211, 94.898... 
    2017-01-01 00:20:00 [212.5618438720703, 284.77923583984375, 241.72... 
    2017-01-01 00:30:00 [302.16595458984375, 91.1768798828125, 194.966... 
    2017-01-01 00:40:00 [90.29680633544922, 337.4345703125, 302.219909... 
    2017-01-01 00:50:00 [94.88722229003906, 319.55804443359375, 204.51... 
    2017-01-01 01:00:00 [133.498046875, 55.166690826416016, 20.7527561... 
    2017-01-01 01:10:00 [356.6704406738281, 151.25257873535156, 200.13... 
    2017-01-01 01:20:00 [350.2489929199219, 33.2842903137207, 145.6609... 
    2017-01-01 01:30:00 [135.7419891357422, 62.66258239746094, 257.807... 
    Freq: 10T, dtype: object 


%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean) 
10 loops, best of 3: 80.3 ms per loop 

%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2) 
10 loops, best of 3: 10.4 ms per loop 

10,000:

def circ_mean2(df): 
    df2 = df.iloc[:, 45:].copy() 
    df1 = df.iloc[:, :45].copy() 
    x = np.sum(np.cos(np.radians(df1.values))*df2.values, axis=0) 
    y = np.sum(np.sin(np.radians(df1.values))*df2.values, axis=0) 
    arctan = np.degrees(np.arctan2(y, x)) 
    return np.where(arctan>0, arctan, arctan+360).tolist() 

100行(隨機數據)的比較

%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean) 
1 loop, best of 3: 6.65 s per loop 

%timeit burst_veldirection_velspeed.groupby(pd.TimeGrouper(freq='10Min')).apply(circ_mean2) 
1 loop, best of 3: 709 ms per loop 
相關問題