阿帕奇星火Python的餘弦相似度超過DataFrames

對於推薦系統，我需要計算餘弦相似度整個星火據幀的所有之間的列。阿帕奇星火Python的餘弦相似度超過DataFrames

在熊貓我來做到這一點：

import sklearn.metrics as metrics 
import pandas as pd 
df= pd.DataFrame(...some dataframe over here :D ...) 
metrics.pairwise.cosine_similarity(df.T,df.T)

生成該列之間的相似矩陣（因爲我使用的換位）

有沒有辦法做同樣的事情在Spark（Python）中？

（我需要這適用於由數百萬行和列的成千上萬的矩陣，所以這就是爲什麼我需要做的是在星火）

來源

2017-05-11 Valerio Storch

您可以使用內置的columnSimilarities()方法可以計算精確的餘弦相似度，也可以使用DIMSUM方法進行估計，對於較大的數據集，這種方法將快得多。使用方法的差異在於，對於後者，您必須指定threshold。

這裏有一個小的可重複的例子：

from pyspark.mllib.linalg.distributed import RowMatrix 
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)]) 

# Convert to RowMatrix 
mat = RowMatrix(rows) 

# Calculate exact and approximate similarities 
exact = mat.columnSimilarities() 
approx = mat.columnSimilarities(0.05) 

# Output 
exact.entries.collect() 
[MatrixEntry(0, 2, 0.991935352214), 
MatrixEntry(1, 2, 0.998441152599), 
MatrixEntry(0, 1, 0.997463284056)]

來源

2017-05-11 17:46:42 mtoto

我該怎麼辦了行，而不是列？ – Charleslmh

@mtoto你知道如何在Scala中實現相同的功能嗎？https://stackoverflow.com/questions/47010126/calculate-cosine-similarity-spark-dataframe –

你能解釋一下matrixEntry的結果嗎？像什麼是0和2？ –

阿帕奇星火Python的餘弦相似度超過DataFrames

回答

相關問題