我正在嘗試在基於特定列(id)的數據框中查找重疊數據範圍(每行提供的開始/結束日期)的更有效方法。在python中查找日期範圍重疊
數據幀排序在「從」列
我覺得有一種方法,以避免「雙重」應用功能,像我一樣...
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
我用了「應用」功能所有的組,每個組內循環,我使用「應用」每行:
def check_date_by_id(df):
df['prevFrom'] = df['from'].shift()
df['prevTo'] = df['to'].shift()
def check_date_by_row(x):
if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
x['overlap'] = False
return x
latest_start = max(x['from'], x.prevFrom)
earliest_end = min(x['to'], x.prevTo)
x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
return x
return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)
df.groupby('id').apply(check_date_by_id)
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
我的代碼是從下面的鏈接啓發:
感謝的人。簡單明瞭。你會不會知道如何執行相同的操作(groupby + check),但是對於所有的日期而不是連續的日期? – Edouard
我不完全確定你的意思......如果日期排序,那麼還能完成什麼?我加了一個'id'分組的例子給你。 – miradulo