我一直在尋找解決這個問題,所有的答案似乎並沒有工作,所以我決定要求在這個特定用例的幫助。我正在合併兩個具有不同維度的csv,但共享兩個相同的列。我第一次放在CSV的成大熊貓數據幀看起來是這樣的:熊貓刪除重複行時合併兩個CSV的不同尺寸
df_td和df_ld:
>>> df_td.head(2)
trans_id store_num cust_id bus_date type
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment
2 0000003 104 111111 10/5/2017 12:00:00 AM Received
>>> df_ld.head(2)
cust_id nxt_date store_num amt_received type_rec
0 111111 11/5/2017 104 10.00 NaN
1 111112 11/6/2017 104 10.00 NaN
運行此代碼後:
merged = pd.merge(df_td, df_ld, how='inner', on=['cust_id','store_num']).fillna(0)
我有這樣的合併數據框:
>>> df_td_ld.head(3)
trans_id store_num cust_id bus_date type nxt_date amt_received type_rec
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 10.00 NaN
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 10.00 NaN
2 0000003 104 111111 10/5/2017 12:00:00 AM Received 11/5/2017 10.00 NaN
正如你所看到的,我得到了df_ld列中的dups,因爲cust_id 111111只出現一次帽子數據框。如果我試着這樣查詢並求和該列,那麼它將在該日期報告30.00而不是正確的10.00,對於該商店中的那個顧客。我嘗試過outer
,left
,right
以及concat
和join
函數,但要麼得到相同的輸出,要麼完全錯誤。
我想是這樣的:
trans_id store_num cust_id bus_date type nxt_date amt_received type_rec
0 0000001 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 0 NaN
1 0000002 104 111111 10/5/2017 12:00:00 AM Payment 11/5/2017 0 NaN
2 0000003 104 111111 10/5/2017 12:00:00 AM Received 11/5/2017 10.00 NaN
是否與MERG /加盟/ Concat的,這是可行的方法嗎? 謝謝!
這可能幫助:http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html –