使用火花數據幀

-2

我有由兩個列movieid的數據幀，並應用到電影中的標籤在下面的格式 -使用火花數據幀

movieid tag                      

1   animation 
1   pixar 
1   animation 
2   comedy

我想指望每個計算每部電影的標籤頻率電影ID每個標籤應用了多少次，還想計算應用於每部電影的標籤總數。我是新來的火花。

來源

2015-12-31 sasmita

這在PySpark，這裏有雲：

創建DF：

sqlContext = SQLContext(sc) 
data = [(1,'animation'),(1,'pixar'),(1,'animation'),(2,'comedy')] 
RDD = sc.parallelize(data) 
orders_df = sqlContext.createDataFrame(RDD,["movieid","tag"]) 
orders_df.show() 

+-------+---------+ 
|movieid|  tag| 
+-------+---------+ 
|  1|animation| 
|  1| pixar| 
|  1|animation| 
|  2| comedy| 
+-------+---------+

計算：

orders_df.groupBy(['movieid','tag']).count().show() #count for each movie id how many times each tags are applied 

+-------+---------+-----+ 
|movieid|  tag|count| 
+-------+---------+-----+ 
|  1| pixar| 1| 
|  1|animation| 2| 
|  2| comedy| 1| 
+-------+---------+-----+ 

orders_df.groupBy(['movieid']).count().show() #number of tags applied to each movie 

+-------+-----+ 
|movieid|count| 
+-------+-----+ 
|  1| 3| 
|  2| 1| 
+-------+-----+

來源

2015-12-31 11:37:29

感謝help.It工作罰款。相當於Scala代碼是 - orders_df.groupBy（「movieid」，「標籤」）。COUNT（）。節目（） – sasmita

使用火花數據幀

回答

相關問題