2017-05-04 18 views

回答

3

更快的是使用map

df1 = pd.DataFrame({'unique_id':[1,2,3,1,2,3], 
        'price':[4,5,6,7,8,9]}) 

print (df1) 

df2 = pd.DataFrame({'unique_id':[1,2,3], 
        'price':[46,55,44]}) 

print (df2) 

df1['price2'] = df1['unique_id'].map(df2.set_index('unique_id')['price']) 
print (df1) 
    price unique_id price2 
0  4   1  46 
1  5   2  55 
2  6   3  44 
3  7   1  46 
4  8   2  55 
5  9   3  44 

np.random.seed(123) 
N = 1000000 
L = np.random.randint(1000,size=N) 
df1 = pd.DataFrame({'unique_id': np.random.choice(L, N), 
        'price':np.random.choice(L, N)}) 
print (df1) 

df2 = pd.DataFrame({'unique_id': np.arange(N), 
        'price':np.random.choice(L, N)}) 

print (df2) 

In [60]: %timeit df1['price2'] = df1['unique_id'].map(df2.set_index('unique_id')['price']) 
1 loop, best of 3: 168 ms per loop 

In [61]: %timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left') 
1 loop, best of 3: 373 ms per loop 

In [62]: %timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2') 
1 loop, best of 3: 252 ms per loop 
3

考慮dataframes df1df2

df1 = pd.DataFrame({ 
     'unique_id': [1, 2, 3], 
     'price': [11, 12, 13], 
    }) 

df2 = pd.DataFrame({ 
    'unique_id': [1, 2, 3, 4, 5], 
    'price': [9, 10, 11, 12, 13], 
}) 

merge

df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left') 

    price unique_id price2 
0  11   1  9 
1  12   2  10 
2  13   3  11 

join

df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2') 

    price unique_id price2 
0  11   1  9 
1  12   2  10 
2  13   3  11 

實驗:FAST
使用numpy.searchsorted

def pir1(d1, d2): 
    u1 = d1.unique_id.values 
    u2 = d2.unique_id.values 
    p2 = d2.price.values 
    a = u2.argsort() 
    u = np.empty_like(a) 
    u[a] = np.arange(a.size) 
    return d1.assign(price2=p2[a][u2[a].searchsorted(u1)]) 

pir1(df1, df2) 

    price unique_id price2 
0  11   1  9 
1  12   2  10 
2  13   3  11 

定時
pir1滿足最快HOD
小數據

%timeit pir1(df1, df2) 
1000 loops, best of 3: 279 µs per loop 

%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price'])) 
1000 loops, best of 3: 892 µs per loop 

%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left') 
1000 loops, best of 3: 1.18 ms per loop 

%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2') 
1000 loops, best of 3: 1.02 ms per loop 

大數據
使用@ jezrael的測試數據

np.random.seed(123) 
N = 1000000 
L = np.random.randint(1000,size=N) 
df1 = pd.DataFrame({'unique_id': np.random.choice(L, N), 
        'price':np.random.choice(L, N)}) 

df2 = pd.DataFrame({'unique_id': np.arange(N), 
        'price':np.random.choice(L, N)}) 


%timeit pir1(df1, df2) 
10 loops, best of 3: 104 ms per loop 

%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price'])) 
10 loops, best of 3: 138 ms per loop 

%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left') 
1 loop, best of 3: 243 ms per loop 

%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2') 
10 loops, best of 3: 168 ms per loop 
3

另一種解決方案:

df1['price_df2'] = df1['unique_id'].map(df2.set_index('unique_id')['price']) 

再次借用@ piRSquared的樣品的DF ;-)

In [42]: df1 
Out[42]: 
    price unique_id 
0  11   1 
1  12   2 
2  13   3 

In [43]: df2 
Out[43]: 
    price unique_id 
0  9   1 
1  10   2 
2  11   3 
3  12   4 
4  13   5 

In [44]: df1['price_df2'] = df1['unique_id'].map(df2.set_index('unique_id')['price']) 

In [45]: df1 
Out[45]: 
    price unique_id price_df2 
0  11   1   9 
1  12   2   10 
2  13   3   11