2016-07-27 148 views
2

我有兩個數據幀,如下比較熊貓dataframes和添加柱

df1  df2 
A  A C 
A1  A1 C1 
A2  A2 C2 
A3  A3 C3 
A1  A4 C4 
A2   
A3   
A4   

列的「A」在DF2中定義列「C」的值。 我想添加一個新列DF1與B列從DF2列「C」

它的價值最終DF1應該是這樣的

df1 
A B 
A1 C1 
A2 C2 
A3 C3 
A1 C1 
A2 C2 
A3 C3 
A4 C4 

我可以遍歷DF2和值添加到df1但由於數據龐大而耗時。

for index, row in df2.iterrows(): 
      df1.loc[df1.A.isin([row['A']]), 'B']= row['C'] 

有人可以幫助我瞭解如何解決這個問題,而無需循環播放df2。

感謝

回答

1

IIUC你可以合併,並重新命名山坳

df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 

In [103]: 
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']}) 
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']}) 
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
merged 

Out[103]: 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C4 
3 A1 C1 
4 A2 C2 
5 A3 C4 
6 A4 C4 
+0

謝謝大家的建議。我使用這個解決方案,因爲它會將df2中的其他列合併到df1。謝謝@EdChum –

+0

'merge'和'map'之間也有語義上的區別,如果df1中的查找不存在於df2中,那麼'merge'將插入'NaN',而'map'則會拋出'KeyError' – EdChum

1

可以使用map通過Series

df1['B'] = df1.A.map(df2.set_index('A')['C']) 
print (df1) 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C3 
3 A1 C1 
4 A2 C2 
5 A3 C3 
6 A4 C4 

是一樣mapdict

d = df2.set_index('A')['C'].to_dict() 
print (d) 
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'} 

df1['B'] = df1.A.map(d) 
print (df1) 
    A B 
0 A1 C1 
1 A2 C2 
2 A3 C3 
3 A1 C1 
4 A2 C2 
5 A3 C3 
6 A4 C4 

時序

len(df1)=7

In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
1000 loops, best of 3: 1.73 ms per loop 

In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C']) 
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 873 µs per loop 

len(df1)=70k

In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'}) 
100 loops, best of 3: 12.8 ms per loop 

In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C']) 
100 loops, best of 3: 6.05 ms per loop 
+0

謝謝@jezreal –

+0

嗯,也許你可以upvote所有的解決方案,謝謝;) – jezrael

1

基於searchsorted方法,這裏有三種方法與不同的索引方式 -

df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values 
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True) 
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]