我認爲你需要inner join
在merge
:
df = pd.merge(X, Y)
或者:
X.set_index(['user_id', 'sku_id'], inplace=True)
df = Y.join(X, how='inner', on=['user_id', 'sku_id'])
另一種解決方案是isin
與boolean indexing
,但它的作品只有唯一user_id
:
X = X.set_index('user_id')
df = X[X['sku_id'].isin(Y.set_index('user_id')['sku_id'])].reset_index()
通常,最好最快的是熊貓使用merge
:
In [143]: %%timeit
...: (Y1.join(X1.set_index(['user_id', 'sku_id']),how='inner',on=['user_id','sku_id']))
...:
1 loop, best of 3: 583 ms per loop
In [144]: %%timeit
...: (pd.merge(X2,Y2))
...:
1 loop, best of 3: 487 ms per loop
In [145]: %%timeit
...: x = pd.MultiIndex.from_arrays([X['user_id'], X['sku_id']])
...: y = pd.MultiIndex.from_arrays([Y['user_id'], Y['sku_id']])
...: inter = x.intersection(y)
...: a = X.set_index(['user_id', 'sku_id']).loc[inter].reset_index()
...:
1 loop, best of 3: 1.26 s per loop
#another solution
In [146]: %%timeit
...: X[(X['user_id'].astype(str) +"_" +X['sku_id'].astype(str)).isin((Y['user_id'].astype(str)+"_"+Y['sku_id'].astype(str)))]
...:
1 loop, best of 3: 6.34 s per loop
如果所有的值是字符串(X = X.astype(str)
,Y = Y.astype(str)
):
In [147]: %%timeit
...: X[(X['user_id'] +"_" +X['sku_id']).isin((Y['user_id']+"_"+Y['sku_id']))]
...:
1 loop, best of 3: 953 ms per loop
代碼時序:
np.random.seed(123)
N = 1000000
X = pd.DataFrame({'user_id':np.random.randint(10000, size=N),
'sku_id': np.random.randint(10000, size=N),
'brand': np.random.randint(10000, size=N)})
X = X.drop_duplicates(subset=['user_id', 'sku_id'])
print (X)
X1,X2 = X.copy(), X.copy()
Y = pd.DataFrame({'user_id':np.random.randint(10000, size=N),
'sku_id': np.random.randint(10000, size=N)})
print (Y)
Y = Y.drop_duplicates(subset=['user_id', 'sku_id'])
Y1,Y2 = Y.copy(), Y.copy()
我想找到更有效的方式來獲得。 merge()花費很多時間。 – Husy
好的,我會試試看。謝謝! – Husy
@Husy - 你確定另一種解決方案更快嗎?因爲如果需要將str列轉換爲int列,則會更慢。檢查我的時間。或者沒有必要將列轉換爲'strings'? – jezrael