2017-08-08 71 views
1

我有兩個數據集,這樣如何合併重疊列

import pandas as pd 
import numpy as np 
df1 = pd.DataFrame({'id': [1, 2,3,4,5], 'first': [np.nan,np.nan,1,0,np.nan], 'second': [1,np.nan,np.nan,np.nan,0]}) 
df2 = pd.DataFrame({'id': [1, 2,3,4,5, 6], 'first': [np.nan,1,np.nan,np.nan,0, 1], 'third': [1,0,np.nan,1,1, 0]}) 

我想

result = pd.merge(df1, df2, left_index=True, right_index=True,on='id', how= 'outer') 
result['first']= result[["first_x", "first_y"]].sum(axis=1) 
result.loc[(result['first_x'].isnull()) & (result['first_y'].isnull()), 'first'] = np.nan 
result.drop(['first_x','first_y'] , 1) 

    id second third first 
0 1 1.0  1.0 NaN 
1 2 NaN  0.0 1.0 
2 3 NaN  NaN 1.0 
3 4 NaN  1.0 0.0 
4 5 0.0  1.0 0.0 
5 6 NaN  0.0 1.0 

的問題是,真正的數據集包括大約200個變量和我的路很長。如何使它更容易?由於

回答

1

您應該能夠使用combine_first

>>> df1.set_index('id').combine_first(df2.set_index('id')) 
    first second third 
id      
1  NaN  1  1 
2  1  NaN  0 
3  1  NaN NaN 
4  0  NaN  1 
5  0  0  1 
6  1  NaN  0 
0

也許應該使用combine_first由亞歷山大提及。如果您想保留id作爲列,您只需使用:

merged = df1.merge(df2)