2013-04-15 56 views
1
合併混亂

我想沒有指數合併兩個大熊貓dataframes:關於熊貓

In [127]: df1 
Out[127]: 
    value1  date id value2 group 
0 -0.2284 2012-04-01 a -0.067469 group d 
1 -0.4875 2012-04-01 b -0.021274 group d 
2 0.1139 2012-04-01 c -0.015978 group d 
3 0.3191 2012-04-01 d 0.022634 group d 
4 -0.0077 2012-04-01 e 0.000000 group d 

In [128]: df2 
Out[128]: 
      date id  value2 group 
23044 2012-04-01 a -0.06701001 group c 
23045 2012-04-01 b -0.02128 group c 
23046 2012-04-01 c   0 group c 
23047 2012-04-01 d   0 group c 
23048 2012-04-01 e   0 group c 

In [129]: pd.merge(df1, df2, how = 'outer', on = ['date', 'id', 'value2', 'group']) 
Out[129]: 
    value1  date id value2 group 
0 -0.2284 2012-04-01 a -0.067469 group d 
1 -0.4875 2012-04-01 b -0.021274 group d 
2 0.1139 2012-04-01 c -0.015978 group d 
3 0.3191 2012-04-01 d 0.022634 group d 
4 -0.0077 2012-04-01 e 0.000000 group d 
5  NaN 2012-04-01 a -0.067010 group c 
6  NaN 2012-04-01 b -0.021280 group c 
7  NaN 2012-04-01 c 0.000000 group c 
8  NaN 2012-04-01 d 0.000000 group c 
9  NaN 2012-04-01 e 0.000000 group c 

這幾乎是所期望的輸出,但我想值1的NaN的C組將通過數值1充從d組根據日期和id。什麼是實現這個目標的正確方法?

回答

1

我認爲這是一個不可避免的兩步過程。

要「填充」value1,您將任何和所有行相關聯(日期,ID),而不考慮組或值。

In [5]: df3 = df2.set_index(['date', 'id']).join(
    ....:  df1.set_index(['date', 'id'])['value1']).reset_index() 

爲了得到最終結果,您將按所有屬性列出區別行,不再將組和值集中在一起。

In [6]: pd.merge(df1, df3, how = 'outer', 
    ....:  on = ['date', 'id', 'value1', 'value2', 'group']) 
Out[6]: 
    value1  date id value2 group 
0 -0.2284 2012-04-01 a -0.067469 group_d 
1 -0.4875 2012-04-01 b -0.021274 group_d 
2 0.1139 2012-04-01 c -0.015978 group_d 
3 0.3191 2012-04-01 d 0.022634 group_d 
4 -0.0077 2012-04-01 e 0.000000 group_d 
5 -0.2284 2012-04-01 a -0.067010 group_c 
6 -0.4875 2012-04-01 b -0.021280 group_c 
7 0.1139 2012-04-01 c 0.000000 group_c 
8 0.3191 2012-04-01 d 0.000000 group_c 
9 -0.0077 2012-04-01 e 0.000000 group_c