如何在Python中快速計算大量向量的餘弦相似度？

我有一套100萬載體，我需要檢索基於餘弦相似性的前25名最接近的向量。如何在Python中快速計算大量向量的餘弦相似度？

Scipy和Sklearn有用於計算餘弦距離/相似度2矢量的實現，但我需要計算100k X 100k大小的餘弦模擬，然後取出前25。 Python計算中有沒有快速的實現？

按@Silmathoron建議，這是我在做什麼 - 第一

#vectors is a list of vectors of size : 100K x 400 i.e. 100K vectors each of dimenions 400 
vectors = numpy.array(vectors) 
similarity = numpy.dot(vectors, vectors.T) 


# squared magnitude of preference vectors (number of occurrences) 
square_mag = numpy.diag(similarity) 

# inverse squared magnitude 
inv_square_mag = 1/square_mag 

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf) 
inv_square_mag[numpy.isinf(inv_square_mag)] = 0 

# inverse of the magnitude 
inv_mag = numpy.sqrt(inv_square_mag) 

# cosine similarity (elementwise multiply by inverse magnitudes) 
cosine = similarity * inv_mag 
cosine = cosine.T * inv_mag 

k = 26 

box_plot_file = file("box_data.csv","w+") 

for sim,query in itertools.izip(cosine,queries): 
    k_largest = heapq.nlargest(k, sim) 
    k_largest = map(str,k_largest) 
    result = query + "," + ",".join(k_largest) + "\n" 
    box_plot_file.write(result) 
box_plot_file.close()

來源

2016-06-25 user3667569

你是什麼意思的「前25名最接近的向量」？前25個最接近的對？或者是其他東西？ –

對於每個矢量，我將用其他矢量計算餘弦相似度，並針對每個矢量爲餘弦相似度選擇25個矢量。 – user3667569

這取決於你想要它有多快......如果你向我們展示一個你需要花費時間的實現的例子（如果它真的太慢，可能在子樣本上），並告訴我們期望的速度增加，那麼我們可以告訴你是否可以通過更好的算法加速python或者如果你需要去cython或多線程... – Silmathoron

我會嘗試更智能的算法，而不是加快蠻力（計算所有的向量對）。如果您的向量維度較低，KDTrees可能會工作，scipy.spatial.KDTree（）。如果它們是高維的，那麼你可能首先需要一個隨機投影： http://scikit-learn.org/stable/modules/random_projection.html

來源

2016-06-26 03:11:14 ericf

如何在Python中快速計算大量向量的餘弦相似度？

回答

相關問題