如何在PySpark ALS中使用長用戶ID

我試圖在PySpark MLlib（1.3.1）中的ALS模型中使用長用戶/產品ID，並且遇到問題。代碼的簡化版本，在這裏給出：如何在PySpark ALS中使用長用戶ID

from pyspark import SparkContext 
from pyspark.mllib.recommendation import ALS, Rating 

sc = SparkContext("","test") 

# Load and parse the data 
d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"] 
data = sc.parallelize(d) 
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2]))) 

# Build the recommendation model using Alternating Least Squares 
rank = 10 
numIterations = 20 
model = ALS.train(ratings, rank, numIterations)

運行此代碼產生一個java.lang.ClassCastException因爲代碼試圖多頭轉換爲整數。通過源代碼查看，Spark中的ml ALS class允許使用長/用戶ID，但mllib ALS class強制使用整數。

問題：在PySpark ALS中使用長用戶/產品ID是否有解決方法？

來源

2015-05-19 Jon

這是已知問題（https://issues.apache.org/jira/browse/SPARK-2465），但它不會很快解決，因爲更改userId接口的界面應該會減慢計算速度。

有幾個解決方案：

可以散列用戶id與哈希爲int（）函數，因爲它會導致少數情況下只是隨機的行壓縮，衝突應該不會影響您推薦的準確性，真。在第一個環節討論。
可以生成與RDD.zipWithUniqueId（）或更少的快速RDD.zipWithIndex獨特INT用戶id，就像在這個線程：How to assign unique contiguous numbers to elements in a Spark RDD

來源

2015-06-10 18:35:50

如何在PySpark ALS中使用長用戶ID

回答

相關問題