2017-04-19 92 views
0

我想應用一個函數fn,這本質上是cosine distance計算在兩​​個大的numpy陣列形狀(10000,100)和(5000,100)row-wise,即我計算一個這些數組中行的每個組合的值。兩個陣列之間的餘弦距離計算 - Python

我的實現:

import math 
def fn(v1,v2): 
    sumxx, sumxy, sumyy = 0, 0, 0 
    for i in range(len(v1)): 
     x = v1[i]; y = v2[i] 
     sumxx += x*x 
     sumyy += y*y 
     sumxy += x*y 
    return sumxy/math.sqrt(sumxx*sumyy) 
val = [] 
for i in range(array1.shape[0]): 
    for j in range(array2.shape[0]): 
     val.append(fn(array1[i, :], array2[j, :])) 

功能非常快,只需要幾毫秒:

CPU times: user 4 ms, sys: 0 ns, total: 4 ms 
Wall time: 1.24 ms 

有沒有什麼有效的方式做到這一點?

+0

'fn'計算兩個向量之間的餘弦相似度。我更新了這個問題 –

回答

1

方法1:我們可以簡單地使用Scipy's cdistcosine距離的功能 -

from scipy.spatial.distance import cdist 

val_out = 1 - cdist(array1, array2, 'cosine') 

方法2:使用matrix-multiplication另一種方法 -

def cosine_vectorized(array1, array2): 
    sumyy = (array2**2).sum(1) 
    sumxx = (array1**2).sum(1, keepdims=1) 
    sumxy = array1.dot(array2.T) 
    return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy) 

方法#3 :使用np.einsum來計算自平方su對於另一個mmations -

def cosine_vectorized_v2(array1, array2): 
    sumyy = np.einsum('ij,ij->i',array2,array2) 
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None] 
    sumxy = array1.dot(array2.T) 
    return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy) 

方法#4:numexpr module瞻來卸載square-root計算爲另一種方法 -

import numexpr as ne 

def cosine_vectorized_v3(array1, array2): 
    sumyy = np.einsum('ij,ij->i',array2,array2) 
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None] 
    sumxy = array1.dot(array2.T) 
    sqrt_sumxx = ne.evaluate('sqrt(sumxx)') 
    sqrt_sumyy = ne.evaluate('sqrt(sumyy)') 
    return ne.evaluate('(sumxy/sqrt_sumxx)/sqrt_sumyy') 

運行測試

# Using same sizes as stated in the question 
In [185]: array1 = np.random.rand(10000,100) 
    ...: array2 = np.random.rand(5000,100) 
    ...: 

In [194]: %timeit 1 - cdist(array1, array2, 'cosine') 
1 loops, best of 3: 366 ms per loop 

In [195]: %timeit cosine_vectorized(array1, array2) 
1 loops, best of 3: 287 ms per loop 

In [196]: %timeit cosine_vectorized_v2(array1, array2) 
1 loops, best of 3: 283 ms per loop 

In [197]: %timeit cosine_vectorized_v3(array1, array2) 
1 loops, best of 3: 217 ms per loop