2016-03-28 60 views
2

我正在使用Spark的推薦系統。PySpark - RDD到ALS輸出中的數據框

列車模型之後,我做下面的代碼來獲得建議 model.recommendProductsForUsers(2)

[(10000, (Rating(user=10000, product=14780773, rating=7.35695469892999e-05), 
      Rating(user=10000, product=17229476, rating=5.648606256948921e-05))), 
(0, (Rating(user=0, product=16750010, rating=0.04405213492474741), 
     Rating(user=0, product=17416511, rating=0.019491942665715176))), 
(20000, (Rating(user=20000, product=17433348, rating=0.017938298063142653), 
      Rating(user=20000, product=17333969, rating=0.01505112418739887)))] 

在這種情況下RecRDD見下文。

>>> type(Rec) 
<class 'pyspark.rdd.RDD'> 

我怎樣才能把這些信息在數據幀像

User | Product | Rating 
1000 | 14780773 | 7.3e-05 
1000 | 17229675 | 5.6e-05 
(...)  (...)  (...) 
2000 | 17333969 | 0.015  

感謝您的時間

+1

必要的功能[覆蓋在PySpark文檔](https://spark.apache.org/docs/1.5.2/api/python /pyspark.sql.html)。查找'createDataFrame'。 –

回答

3

爲了驗證,我用下面的pyspark代碼重現您RDD

from pyspark.mllib.recommendation import Rating 

Rec = sc.parallelize([(10000, (Rating(user=10000, product=14780773, rating=7.35695469892999e-05), 
           Rating(user=10000, product=17229476, rating=5.648606256948921e-05))), 
         (0, (Rating(user=0, product=16750010, rating=0.04405213492474741), 
          Rating(user=0, product=17416511, rating=0.019491942665715176))), 
         (20000, (Rating(user=20000, product=17433348, rating=0.017938298063142653), 
           Rating(user=20000, product=17333969, rating=0.01505112418739887)))]) 

此RDD由鍵值對組成,每個值由記錄w ith評級元組。您需要映射RDD以僅保留記錄,然後將結果分解爲每個建議都有單獨的元組。該flatMap(f)功能就會凝結,像這樣兩個步驟:

flatRec = Rec.flatMap(lambda p: p[1]) 

這會導致RDD形式:

[Rating(user=10000, product=14780773, rating=7.35695469892999e-05), 
Rating(user=10000, product=17229476, rating=5.648606256948921e-05), 
Rating(user=0, product=16750010, rating=0.04405213492474741), 
Rating(user=0, product=17416511, rating=0.019491942665715176), 
Rating(user=20000, product=17433348, rating=0.017938298063142653), 
Rating(user=20000, product=17333969, rating=0.01505112418739887)] 

現在所需要的是使用createDataFrame功能把它變成一個數據幀。每個評級元組將變成一個DataFrame行,並且由於這些項目被標記,您不需要指定一個模式。

recDF = sqlContext.createDataFrame(flatRec).show() 

這將輸出以下內容:

+-----+--------+--------------------+ 
| user| product|    rating| 
+-----+--------+--------------------+ 
|10000|14780773| 7.35695469892999E-5| 
|10000|17229476|5.648606256948921E-5| 
| 0|16750010| 0.04405213492474741| 
| 0|17416511|0.019491942665715176| 
|20000|17433348|0.017938298063142653| 
|20000|17333969| 0.01505112418739887| 
+-----+--------+--------------------+ 
+1

安德里亞,非常好:)非常感謝您的解釋。 – Kardu