pandas.merge：匹配最近的時間戳> =一系列時間戳

我有兩個數據幀，它們都包含不規則間隔的毫秒分辨率時間戳列。我的目標是匹配行，以便對於每個匹配行，1）第一個時間戳始終小於或等於第二個時間戳，以及2）匹配的時間戳對於滿足1）的所有時間戳對是最接近的。pandas.merge：匹配最近的時間戳> =一系列時間戳

有沒有辦法用pandas.merge做到這一點？

來源

2014-01-18 Tom Bennett

merge()不能做這種連接的，但您可以使用searchsorted()：

創建一些隨機時間戳：t1，t2，也有按升序排列：

import pandas as pd 
import numpy as np 
np.random.seed(0) 

base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]") 

a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000 
t1 = base + a 
t1.sort() 

b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000 
t2 = base + b 
t2.sort()

呼叫searchsorted()找到索引t1t2中的每個值：

idx = np.searchsorted(t1, t2) - 1 
mask = idx >= 0 

df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})

這裏是輸出：

      t1       t2 
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000 
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000 
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000 
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000 
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000 
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000 
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000 
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000 
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000

爲了通過圖形查看此結果：

import pylab as pl 
pl.figure(figsize=(18, 4)) 
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1) 
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2) 
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2) 
pl.margins(0.02)

輸出：

enter image description here

綠線是t1，藍色線是t2，每個t2從t1中選擇紅線。

來源

2014-01-18 12:57:52 HYRY

我用不同的方式比HYRY：

確實與外常規合併連接（如何= '外'）;
按日期排序;
使用fillna（method ='pad'）來填充你需要的列，如果你想要填充上一行，填充'pad';
刪除外連接中不需要的所有行。

所有這一切都可以寫在幾行字：

df=pd.merge(df0, df1, on='Date', how='outer') 
df=df.sort(['Date'], ascending=[1]) 
headertofill=list(df1.columns.values) 
df[headertofill]=df[headertofill].fillna(method='pad') 
df=df[pd.isnull(df[var_from_df0_only])==False]

來源

2015-10-06 20:28:32 Yaron

您還沒有定義什麼是var_from_df0_only喜 – Pigeon

鴿子，你最希望保留原始數據幀（DF0）和具有豐富它的時代不同的一個（df1）。使用外部聯接，您將從df1獲得一些額外的行，因此爲了刪除那些只使用df0中的一列但不包含在df1（「var_from_df0_only」）中的行，之後外連接對於額外的行具有空值。 – Yaron

Pandas sort（）已被棄用。我們現在必須改用sort_values（） –

這裏是一個更簡單，更通用的方法。

# data and signal are want we want to merge 
keys = ['channel', 'timestamp'] # Could be simply ['timestamp'] 
index = data.loc[keys].set_index(keys).index # Make index from columns to merge on 
padded = signal.reindex(index, method='pad') # Key step -- reindex with filling 
joined = data.join(padded, on=keys) # Join to data if needed

來源

2016-10-19 22:19:40

熊貓現在有功能merge_asof正是這樣

來源

2018-03-06 05:11:56 cdarlint

pandas.merge：匹配最近的時間戳> =一系列時間戳

回答

相關問題