方法1:我們可以簡單地使用Scipy's cdist
其cosine
距離的功能 -
from scipy.spatial.distance import cdist
val_out = 1 - cdist(array1, array2, 'cosine')
方法2:使用matrix-multiplication
另一種方法 -
def cosine_vectorized(array1, array2):
sumyy = (array2**2).sum(1)
sumxx = (array1**2).sum(1, keepdims=1)
sumxy = array1.dot(array2.T)
return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy)
方法#3 :使用np.einsum
來計算自平方su對於另一個mmations -
def cosine_vectorized_v2(array1, array2):
sumyy = np.einsum('ij,ij->i',array2,array2)
sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
sumxy = array1.dot(array2.T)
return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy)
方法#4:在numexpr
module瞻來卸載square-root
計算爲另一種方法 -
import numexpr as ne
def cosine_vectorized_v3(array1, array2):
sumyy = np.einsum('ij,ij->i',array2,array2)
sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
sumxy = array1.dot(array2.T)
sqrt_sumxx = ne.evaluate('sqrt(sumxx)')
sqrt_sumyy = ne.evaluate('sqrt(sumyy)')
return ne.evaluate('(sumxy/sqrt_sumxx)/sqrt_sumyy')
運行測試
# Using same sizes as stated in the question
In [185]: array1 = np.random.rand(10000,100)
...: array2 = np.random.rand(5000,100)
...:
In [194]: %timeit 1 - cdist(array1, array2, 'cosine')
1 loops, best of 3: 366 ms per loop
In [195]: %timeit cosine_vectorized(array1, array2)
1 loops, best of 3: 287 ms per loop
In [196]: %timeit cosine_vectorized_v2(array1, array2)
1 loops, best of 3: 283 ms per loop
In [197]: %timeit cosine_vectorized_v3(array1, array2)
1 loops, best of 3: 217 ms per loop
'fn'計算兩個向量之間的餘弦相似度。我更新了這個問題 –