2017-05-09 138 views
1

我的問題結合兩隻大熊貓dataframes,重採樣在一個時間列在某種程度上類似於這一個有幾個關鍵的不同:通過平均

Combine two Pandas dataframes, resample on one time column, interpolate

我有不同的採集系統同時採集兩個數據集不同的採樣率 - 每秒採集一次數據(df2),第二秒採集數據每隔11分鐘(df1)。我想創建一個包含兩個數據集的單個數據幀,其中組合數據幀的時間索引將來自11分鐘採樣頻率數據幀(df1)。該數據幀內的數據將是來自df1的原始數據,其中來自1秒數據幀(df2)的數據在相關的11分鐘時段內平均並附加到df1。

下面是一些示例數據:

from datetime import datetime, timedelta 
import pandas as pd 
import numpy as np 

todays_date = datetime.now().date() 
index1 = pd.date_range(todays_date-timedelta(10), periods=10, freq='11min') 
index2 = pd.date_range(todays_date-timedelta(10), periods=6000, freq='S') 
columns1 = [15, 17, 19, 21, 24, 27, 30, 34, 38, 43, 48, 54, 60, 67, 75, 84, 
94, 105, 118, 132, 148, 166, 186, 208, 233, 261, 292, 327, 366, 410, 459, 
514, 576, 645, 722, 809, 906] 
columns2 = [103.73, 111.469, 119.786, 128.723, 138.327, 148.647, 159.737, 
171.655, 184.462, 198.224, 213.013, 228.905, 245.984, 264.336, 284.057, 
305.25, 328.024, 352.497, 378.797, 407.058, 437.427, 470.063, 505.133,  
542.82, 583.319, 626.839, 673.606, 723.862, 777.868, 835.903, 898.268, 
965.286, 1037.304, 1114.695, 1197.86, 1287.23, 1383.267, 1486.47, 1597.372, 
1716.548, 1844.616, 1982.239, 2130.13, 2289.054, 2459.835, 2643.358, 
2840.573, 3052.502, 3280.243, 3524.975, 3787.966, 4070.578, 4374.274, 
4700.629, 5051.333, 5428.202, 5833.189, 6268.39, 6736.061, 7238.624, 
7778.682, 8359.033, 8982.682, 9652.861, 10373.039, 11146.949, 11978.599, 
12872.296, 13832.67, 14864.696, 15973.718, 17165.483, 18446.161, 19822.39, 
21301.296, 22890.539, 24598.352, 26433.582] 
data1 = np.random.rand(10, 37)*1000 
data2 = np.random.rand(6000, 78)*1000 
df1 = pd.DataFrame(data1, index=index1, columns=columns1) 
df2 = pd.DataFrame(data2, index=index2, columns=columns2) 

什麼是最簡單的方法是什麼?

回答

3

我想你需要concat + resample

df2 = pd.concat([df1, df2.resample('11T').mean()], axis=1) 

另一種方法是使用concat + groupby + Grouper

df2 = pd.concat([df1, df2.groupby(pd.Grouper(freq='11T')).mean()], axis=1) 

用於試驗的創建更小DataFrames和頻率在df2變更爲1.1Min

np.random.seed(123) 
todays_date = datetime.now().date() 
index1 = pd.date_range(todays_date-timedelta(10), periods=2, freq='11min') 
index2 = pd.date_range(todays_date-timedelta(10), periods=20, freq='1.1Min') 
columns1 = [15, 17] 
columns2 = [103.73, 111.469, 119.78] 
data1 = np.random.randint(10, size=(2, 2)) 
data2 = np.random.randint(3, size=(20, 3)) 
df1 = pd.DataFrame(data1, index=index1, columns=columns1) 
df2 = pd.DataFrame(data2, index=index2, columns=columns2) 
print (df1) 
        15 17 
2017-04-29 00:00:00 2 2 
2017-04-29 00:11:00 6 1 

print (df2) 
        103.730 111.469 119.780 
2017-04-29 00:00:00  2  1  2 
2017-04-29 00:01:06  1  0  1 
2017-04-29 00:02:12  2  1  0 
2017-04-29 00:03:18  2  0  1 
2017-04-29 00:04:24  2  1  0 
2017-04-29 00:05:30  0  0  0 
2017-04-29 00:06:36  1  2  0 
2017-04-29 00:07:42  2  0  0 
2017-04-29 00:08:48  1  0  1 
2017-04-29 00:09:54  0  0  0 
2017-04-29 00:11:00  2  1  1 
2017-04-29 00:12:06  2  2  2 
2017-04-29 00:13:12  1  0  0 
2017-04-29 00:14:18  2  1  0 
2017-04-29 00:15:24  2  2  2 
2017-04-29 00:16:30  2  1  2 
2017-04-29 00:17:36  0  1  0 
2017-04-29 00:18:42  2  0  2 
2017-04-29 00:19:48  1  2  0 
2017-04-29 00:20:54  2  2  0 

df3 = pd.concat([df1, df2.resample('11T').mean()], axis=1) 
print (df3) 
        15.000 17.000 103.730 111.469 119.780 
2017-04-29 00:00:00  2  2  1.3  0.5  0.5 
2017-04-29 00:11:00  6  1  1.6  1.2  0.9 

df3 = pd.concat([df1, df2.groupby(pd.Grouper(freq='11T')).mean()], axis=1) 
print (df3) 
        15.000 17.000 103.730 111.469 119.780 
2017-04-29 00:00:00  2  2  1.3  0.5  0.5 
2017-04-29 00:11:00  6  1  1.6  1.2  0.9 
+0

感謝您的解決方案。出於興趣,當df1中的時間步長不規則時,可能會使用相同的方法(通過首先讀取時間戳?) – user1912925

+0

我認爲是的,只有在沒有數據的行中才會得到'NaN'。 – jezrael

+0

好吧,聽起來很合理。 – user1912925