Numpy Cosine在大集合上的類似差異

我需要在大矩陣上使用Scikit-learn sklearn.metric.pairwise.cosine_similarity。對於一些優化我需要計算矩陣的一些行，所以我嘗試了不同的方法。Numpy Cosine在大集合上的類似差異

我發現，在某些情況下結果取決於載體的大小是不同的，我看到在這個測試情況下，這個奇怪的行爲（大載體，調換和估計餘弦）：

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import spatial 
import numpy as np 
from scipy.sparse import csc_matrix 

size=200 
a=np.array([[1,0,1,0]]*size) 
sparse_a=csc_matrix(a.T) 
#standard cosine similarity between the whole transposed matrix, take only the first row 
res1=cosine_similarity(a.T,a.T)[0] 
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first row calculus only) 
res2=cosine_similarity([a.T[0]],a.T)[0] 
#sparse matrix implementation with the transposed, which should be faster 
res3=cosine_similarity(sparse_a,sparse_a)[0] 
print("res1: ",res1) 
print("res2: ",res2) 
print("res3: ",res3) 
print("res1 vs res2: ",res1==res2) 
print("res1 vs res3: ",res1==res3) 
print("res2 vs res3: ", res2==res3)

如果「大小」設置爲我得到這個結果，這是確定：

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [ True True True True] 
res1 vs res3: [ True True True True] 
res2 vs res3: [ True True True True]

但如果「大小「設置爲以上，一些奇怪的事情發生了：

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [False True False True] 
res1 vs res3: [False True False True] 
res2 vs res3: [ True True True True]

有誰知道我錯過了什麼？

在此先感謝

來源

2016-09-07 Valerio Storch

爲了比較numpy.array你必須使用np.isclose不是相等運算符。嘗試：

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import spatial 
import numpy as np 
from scipy.sparse import csc_matrix 

size=2000 
a=np.array([[1,0,1,0]]*size) 
sparse_a=csc_matrix(a.T) 
#standard cosine similarity between the whole transposed matrix, take only the first row 
res1=cosine_similarity(a.T,a.T)[0] 
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first  row calculus only) 
res2=cosine_similarity([a.T[0]],a.T)[0] 
#sparse matrix implementation with the transposed, which should befaster 
res3=cosine_similarity(sparse_a,sparse_a)[0] 
print("res1: ",res1) 
print("res2: ",res2) 
print("res3: ",res3) 
print("res1 vs res2: ", np.isclose(res1, res2)) 
print("res1 vs res3: ", np.isclose(res1, res3)) 
print("res2 vs res3: ", np.isclose(res2, res2))

的結果是：

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [ True True True True] 
res1 vs res3: [ True True True True] 
res2 vs res3: [ True True True True]

預期。

來源

2016-09-07 12:34:27

非常感謝你的回答，我跑了它，它的工作原理。但是根據文檔，np.iscloseto（）*「返回一個布爾數組，其中兩個數組在元素方向上的公差範圍內相等。」* 這似乎證實了矩陣中的值不完全相同（實際上它們是在公差範圍內彼此接近）。我的問題的關鍵是**爲什麼cosine_similarity在不同的情況下返回不同的值**。 –

'cosine_similarity'在不同情況下不會返回不同的值。它總是返回'[1. 0. 1. 0.]'。問題在於比較方式。 'numpy.array'不能使用'==' –

Numpy Cosine在大集合上的類似差異

回答

相關問題