2013-07-13 43 views
33

給出一個稀疏矩陣列表,計算矩陣中每列(或行)之間餘弦相似度的最佳方法是什麼?我寧願不重複n次選擇 - 兩次。在給定稀疏矩陣數據的情況下,Python中用於計算餘弦相似度的最快方法是什麼?

假設輸入矩陣是:

A= 
[0 1 0 0 1 
0 0 1 1 1 
1 1 0 1 0] 

的稀疏表示是:

A = 
0, 1 
0, 4 
1, 2 
1, 3 
1, 4 
2, 0 
2, 1 
2, 3 

在Python,它是簡單的與基質輸入格式的工作:

import numpy as np 
from sklearn.metrics import pairwise_distances 
from scipy.spatial.distance import cosine 

A = np.array(
[[0, 1, 0, 0, 1], 
[0, 0, 1, 1, 1], 
[1, 1, 0, 1, 0]]) 

dist_out = 1-pairwise_distances(A, metric="cosine") 
dist_out 

授予:

array([[ 1.  , 0.40824829, 0.40824829], 
     [ 0.40824829, 1.  , 0.33333333], 
     [ 0.40824829, 0.33333333, 1.  ]]) 

這對於全矩陣輸入很好,但我真的想從稀疏表示開始(由於矩陣的大小和稀疏性)。任何關於如何最好地完成的想法?提前致謝。

+0

聯不應該稀疏A的第一線是'0,1'? – seth

+0

A通常有多大? – seth

+0

Seth是的,我修改了它。謝謝。大小目前在成千上萬個非零的條目中,但我想要處理的數量級要大2-3個數量級。 – zbinsd

回答

24

你可以計算一個稀疏矩陣的直接使用sklearn行成對餘弦相似。隨着0.17版本還支持稀疏輸出:

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import sparse 

A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) 
A_sparse = sparse.csr_matrix(A) 

similarities = cosine_similarity(A_sparse) 
print('pairwise dense output:\n {}\n'.format(similarities)) 

#also can output sparse matrices 
similarities_sparse = cosine_similarity(A_sparse,dense_output=False) 
print('pairwise sparse output:\n {}\n'.format(similarities_sparse)) 

結果:

pairwise dense output: 
[[ 1.   0.40824829 0.40824829] 
[ 0.40824829 1.   0.33333333] 
[ 0.40824829 0.33333333 1.  ]] 

pairwise sparse output: 
(0, 1) 0.408248290464 
(0, 2) 0.408248290464 
(0, 0) 1.0 
(1, 0) 0.408248290464 
(1, 2) 0.333333333333 
(1, 1) 1.0 
(2, 1) 0.333333333333 
(2, 0) 0.408248290464 
(2, 2) 1.0 

如果你想逐列餘弦相似之前簡單地轉你的輸入矩陣:

A_sparse.transpose() 
1

您應該檢查出scipy.sparselink)。就像你使用普通矩陣一樣,你可以在這些稀疏矩陣上應用運算。

+2

'scipy.sparse'不支持這種操作。 – Medeiros

32

以下方法比scipy.spatial.distance.pdist快大約30倍。它在大型矩陣上工作得非常快(假設您有足夠的RAM)

請參閱下面關於如何優化稀疏性的討論。

# base similarity matrix (all dot products) 
# replace this with A.dot(A.T).toarray() for sparse representation 
similarity = numpy.dot(A, A.T) 


# squared magnitude of preference vectors (number of occurrences) 
square_mag = numpy.diag(similarity) 

# inverse squared magnitude 
inv_square_mag = 1/square_mag 

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf) 
inv_square_mag[numpy.isinf(inv_square_mag)] = 0 

# inverse of the magnitude 
inv_mag = numpy.sqrt(inv_square_mag) 

# cosine similarity (elementwise multiply by inverse magnitudes) 
cosine = similarity * inv_mag 
cosine = cosine.T * inv_mag 

如果你的問題是典型的大型二進制偏好的問題,你必須比其他一名維不少多個條目。此外,短維度是您想要計算其之間相似度的條目。我們稱這個維度爲'item'維度。

如果是這種情況,請使用scipy.sparse作爲行列出您的'項目',並創建A。然後按照指示替換第一行。

如果你的問題是非典型的,你需要更多的修改。這些應該是相當簡單的numpy操作的替代品,它們的scipy.sparse等價物。

0

嗨,你可以這樣來做

temp = sp.coo_matrix((data, (row, col)), shape=(3, 59)) 
    temp1 = temp.tocsr() 

    #Cosine similarity 
    row_sums = ((temp1.multiply(temp1)).sum(axis=1)) 
    rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0] 
    row_indices, col_indices = temp1.nonzero() 
    temp1.data /= rows_sums_sqrt[row_indices] 
    temp2 = temp1.transpose() 
    temp3 = temp1*temp2 
2

我把所有這些答案,並寫了一個腳本來驗證每個結果(見下面的斷言)和2.看哪個是最快的。 代碼和結果如下:

# Imports 
import numpy as np 
import scipy.sparse as sp 
from scipy.spatial.distance import squareform, pdist 
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.preprocessing import normalize 
from sklearn.metrics.pairwise import cosine_similarity 

# Create an adjacency matrix 
np.random.seed(42) 
A = np.random.randint(0, 2, (10000, 100)).astype(float).T 

# Make it sparse 
rows, cols = np.where(A) 
data = np.ones(len(rows)) 
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1)) 

print "Input data shape:", Asp.shape 

# Define a function to calculate the cosine similarities a few different ways 
def calc_sim(A, method=1): 
    if method == 1: 
     return 1 - squareform(pdist(A, metric='cosine')) 
    if method == 2: 
     Anorm = A/np.linalg.norm(A, axis=-1)[:, np.newaxis] 
     return np.dot(Anorm, Anorm.T) 
    if method == 3: 
     Anorm = A/np.linalg.norm(A, axis=-1)[:, np.newaxis] 
     return linear_kernel(Anorm) 
    if method == 4: 
     similarity = np.dot(A, A.T) 

     # squared magnitude of preference vectors (number of occurrences) 
     square_mag = np.diag(similarity) 

     # inverse squared magnitude 
     inv_square_mag = 1/square_mag 

     # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) 
     inv_square_mag[np.isinf(inv_square_mag)] = 0 

     # inverse of the magnitude 
     inv_mag = np.sqrt(inv_square_mag) 

     # cosine similarity (elementwise multiply by inverse magnitudes) 
     cosine = similarity * inv_mag 
     return cosine.T * inv_mag 
    if method == 5: 
     ''' 
     Just a version of method 4 that takes in sparse arrays 
     ''' 
     similarity = A*A.T 
     square_mag = np.array(A.sum(axis=1)) 
     # inverse squared magnitude 
     inv_square_mag = 1/square_mag 

     # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) 
     inv_square_mag[np.isinf(inv_square_mag)] = 0 

     # inverse of the magnitude 
     inv_mag = np.sqrt(inv_square_mag).T 

     # cosine similarity (elementwise multiply by inverse magnitudes) 
     cosine = np.array(similarity.multiply(inv_mag)) 
     return cosine * inv_mag.T 
    if method == 6: 
     return cosine_similarity(A) 

# Assert that all results are consistent with the first model ("truth") 
for m in range(1, 7): 
    if m in [5]: # The sparse case 
     np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m)) 
    else: 
     np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m)) 

# Time them: 
print "Method 1" 
%timeit calc_sim(A, method=1) 
print "Method 2" 
%timeit calc_sim(A, method=2) 
print "Method 3" 
%timeit calc_sim(A, method=3) 
print "Method 4" 
%timeit calc_sim(A, method=4) 
print "Method 5" 
%timeit calc_sim(Asp, method=5) 
print "Method 6" 
%timeit calc_sim(A, method=6) 

結果:

Input data shape: (100, 10000) 
Method 1 
10 loops, best of 3: 71.3 ms per loop 
Method 2 
100 loops, best of 3: 8.2 ms per loop 
Method 3 
100 loops, best of 3: 8.6 ms per loop 
Method 4 
100 loops, best of 3: 2.54 ms per loop 
Method 5 
10 loops, best of 3: 73.7 ms per loop 
Method 6 
10 loops, best of 3: 77.3 ms per loop 
5

我已經嘗試了上述的一些方法。但是,@ zbinsd的實驗有其侷限性。實驗中使用的矩陣稀疏性非常低,而實際稀疏度通常超過90%。 在我的情況下,稀疏是(7000,25000)的形狀和97%的稀疏性。方法4非常慢,我不能容忍得到結果。我使用方法6,在10秒內完成。令人驚訝的是,我嘗試了下面的方法,它僅在0.247秒內完成。

import sklearn.preprocessing as pp 

def cosine_similarities(mat): 
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0) 
    return col_normed_mat.T * col_normed_mat 

這種高效的方法由enter link description here