2015-10-22 81 views
3

我有兩個數據幀(logsfailures),我想合併,以便我在logs中添加一個包含'失敗'中找到的最接近日期值的列。熊貓合併數據幀到最近的時間

的代碼來生成logsfailures,和所需output低於:

import pandas as pd 
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])}) 
logs['date-time']=pd.to_datetime(logs['date-time']) 
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])}) 
failures['date']=pd.to_datetime(failures['date']) 
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])}) 
output['date-time']=pd.to_datetime(output['date-time']) 

任何想法?真正的數據集非常大,所以效率也是一個問題。

回答

3

您可以使用method =「nearest」重新索引。有可能是一個更合適的方法,但是使用與索引的故障日誌和值的系列作品:

In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"]) 

In [12]: failures_dt.reindex(logs["date-time"], method="nearest") 
Out[12]: 
date-time 
2015-10-23 10:20:54 2015-10-23 
2015-10-22 09:51:32 2015-10-22 
2015-10-21 06:51:32 2015-10-21 
2015-10-28 16:59:32 2015-10-23 
2015-10-25 04:41:32 2015-10-23 
2015-10-24 11:50:11 2015-10-23 
dtype: datetime64[ns] 

In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values 

In [14]: logs 
Out[14]: 
      date-time var1 nearest 
0 2015-10-23 10:20:54  0 2015-10-23 
1 2015-10-22 09:51:32  1 2015-10-22 
2 2015-10-21 06:51:32  3 2015-10-21 
3 2015-10-28 16:59:32  1 2015-10-23 
4 2015-10-25 04:41:32  2 2015-10-23 
5 2015-10-24 11:50:11  4 2015-10-23 
1

在熊貓> = 0.19.0您現在可以使用pandas.merge_asof,要接近一致。在0.19的情況下,您只能在取得最新的失敗值之前或取得對數值。然而,with 0.20你可以在任何方向上得到最近的。

執行自動合併。這與左連接類似,除了我們 匹配最近的鍵而不是相等的鍵。

對於左邊的DataFrame中的每一行,我們選擇 右邊的DataFrame的'on'鍵小於或等於左邊的 鍵的最後一行。這兩個DataFrames必須按鍵排序。

In [3]: failures.sort_values("date", inplace=True) 

In [6]: logs2=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50 
    ...: :11', "20/10/2015 01:02:03"]),'var1':pd.Series([0,1,3,1,2,4, 99])}) 
    ...: 

In [7]: logs2['date-time']=pd.to_datetime(logs2['date-time']) 

In [8]: logs2.sort_values("date-time", inplace=True) 

In [9]: logs2 
Out[9]: 
      date-time var1 
6 2015-10-20 01:02:03 99 
2 2015-10-21 06:51:32  3 
1 2015-10-22 09:51:32  1 
0 2015-10-23 10:20:54  0 
5 2015-10-24 11:50:11  4 
4 2015-10-25 04:41:32  2 
3 2015-10-28 16:59:32  1 

In [10]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date") 
Out[10]: 
      date-time var1  date failure 
0 2015-10-20 01:02:03 99  NaT  NaN 
1 2015-10-21 06:51:32  3 2015-10-21  1.0 
2 2015-10-22 09:51:32  1 2015-10-22  1.0 
3 2015-10-23 10:20:54  0 2015-10-23  1.0 
4 2015-10-24 11:50:11  4 2015-10-23  1.0 
5 2015-10-25 04:41:32  2 2015-10-23  1.0 
6 2015-10-28 16:59:32  1 2015-10-23  1.0 

In [11]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date", direction="nearest") 
Out[11]: 
      date-time var1  date failure 
0 2015-10-20 01:02:03 99 2015-10-21  1 
1 2015-10-21 06:51:32  3 2015-10-21  1 
2 2015-10-22 09:51:32  1 2015-10-22  1 
3 2015-10-23 10:20:54  0 2015-10-23  1 
4 2015-10-24 11:50:11  4 2015-10-23  1 
5 2015-10-25 04:41:32  2 2015-10-23  1 
6 2015-10-28 16:59:32  1 2015-10-23  1