2015-11-24 17 views
1

我有兩個DataFrame,我需要合併兩個,我需要添加一個指定它是否被接受的列。只有條件爲真的熊貓才合併

我有這樣的:

dfa[dfa.CONTROL.isin([334030860978638])] 

Out[107]: 
      CONTROL    A    B    DATE_HOUR 
1629136  334030860978638  525562414612 52447860015000 2015-08-02 16:32:00 
1629137  334030860978638  525562414612 52447860015000 2015-08-02 16:42:32 
1629138  334030860978638  525562414612 52447860015000 2015-08-02 18:33:12 
1629139  334030860978638  525562414612 52447860015000 2015-08-03 19:40:19 


dfb[dfb.control.isin([334030860978638])] 

Out[108]: 
      control    a    b    date_hour 
id    
299366338 334030860978638  525562414612 447860015000 2015-08-02 16:33:08 
299392621 334030860978638  525562414612 447860015000 2015-08-02 16:43:40 
299665465 334030860978638  525562414612 447860015000 2015-08-02 18:34:21 

view = dfa.merge(dfb, left_on=['CONTROL', 'A', 'B'], 
       right_on=['control', 'a', 'b'], how='outer') 

我需要比較DATE_HOUR,與date_hour如果記錄是在例如3600秒時間範圍,還我需要確定是否在時間上存在多個記錄,然後我會得到最近的一個並標記它,在接受的新列中,我將設置爲True,否則爲False。

我的預期輸出:

CONTROL   A    B    DATE_HOUR   control    a    b    date_hour   accepted 
334030860978638 525562414612 52447860015000 2015-08-02 16:32:00 334030860978638  525562414612 52447860015000 2015-08-02 16:32:08 True 
334030860978638 525562414612 52447860015000 2015-08-02 16:42:32 334030860978638  525562414612 52447860015000 2015-08-02 16:43:40 True 
334030860978638 525562414612 52447860015000 2015-08-02 18:33:12 334030860978638  525562414612 52447860015000 2015-08-02 18:34:21 True 
334030860978638 525562414612 52447860015000 2015-08-03 19:40:19 NaN     NaN    Nan    NaT     False 

我可以使用適用的方法來這個任務?有人可以幫助我在使用熊貓正確的方式做。

+0

這是一個非常有趣的問題。圖像表明你的'dfa'已經有'dfb'中的列,只有值缺失。然後它就成了一個缺失的數據問題,你基本上想爲每一行dfa的最近鄰居求解。先在'CONTROL'和'control'上分組,然後在'DATE_HOUR'和'date_hour'上排序。接下來,您必須查找並根據您的原因調整最近鄰居算法。 – Kartik

+0

我會採納你的建議和感謝,真的讓我感到困惑。 – paridin

回答

0

similar problem幫助我解決我的問題。

def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'): 
    match = match[match[groupname] == group.name] 
    try: 
     nbrs = NearestNeighbors(1).fit(match[rname].values[:, None]) 
     dist, ind = nbrs.kneighbors(group[lname].values[:, None]) 
     group[lname] = group[lname] 
     group[rname] = match[rname].values[ind.ravel()] 
     time_diff = (group[rname] - group[lname])/np.timedelta64(1, 'm')   
     group[name_field_diff] = time_diff.abs() 
    except: 
     pass 
    return group 

d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'}, 
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'}, 
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'}, 
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}] 

d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'}, 
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'}, 
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}] 

df1 = pd.DataFrame(d1) 
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S') 

df2 = pd.DataFrame(d2) 
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S') 

df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour') 

    A    B    CONTROL    DATE_HOUR    date_hour    diff_minutes 
0 525562414612 52447860015000 334030860978638  2015-08-02 16:32:00  2015-08-02 16:33:08  1.133333 
1 525562414612 52447860015000 334030860978638  2015-08-02 16:42:32  2015-08-02 16:43:40  1.133333 
2 525562414612 52447860015000 334030860978638  2015-08-02 18:33:12  2015-08-02 18:34:21  1.150000 
3 525562414612 52447860015000 334030860978638  2015-08-02 19:40:19  2015-08-02 18:34:21  65.966667 

現在我使用我的空白過濾來確定哪些記錄不適合。

df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)] 

    A    B    CONTROL    DATE_HOUR 
3 525562414612 52447860015000 334030860978638  2015-08-02 19:40:19