2016-12-05 76 views
0

在進行市場購物籃分析,提取規則之後,...我也想計算項目的常見出現 - 作爲元組 - 以便在Tableau中對其進行可視化。您可以在下面找到每個ID /籃子成員的項目。PySpark計數常見事件

df = sqlContext.createDataFrame([("ID_1", "Butter"), 
("ID_1", "Toast"), 
("ID_1","Ham"), 
("ID_2", "Ham"), 
("ID_2", "Toast"), 
("ID_2","Egg"),], 
["ID","VAL"]) 

df.show() 

+----+------+ 
| ID| VAL| 
+----+------+ 
|ID_1|Butter| 
|ID_1| Toast| 
|ID_1| Ham| 
|ID_2| Ham| 
|ID_2| Toast| 
|ID_2| Egg| 
+----+------+ 

這是我想達到的效果:

res = sqlContext.createDataFrame([("Butter", "Butter", 0), 
("Butter", "Toast", 1), 
("Butter", "Ham", 1), 
("Butter", "Egg", 0), 
("Toast", "Toast", 0), 
("Toast", "Ham", 2), 
("Toast", "Egg", 1), 
("Ham", "Ham", 0), 
("Ham", "Egg", 0), 
("Egg", "Egg", 0),], 
["VAL_1","VAL_2", "COUNT"]) 

res.show() 

+------+------+-----+ 
| VAL_1| VAL_2|COUNT| 
+------+------+-----+ 
|Butter|Butter| 0| 
|Butter| Toast| 1| 
|Butter| Ham| 1| 
|Butter| Egg| 0| 
| Toast| Toast| 0| 
| Toast| Ham| 2| 
| Toast| Egg| 1| 
| Ham| Ham| 0| 
| Ham| Egg| 0| 
| Egg| Egg| 0| 
+------+------+-----+ 

你的幫助是非常感謝!謝謝!

回答

0

下面嘗試,你可能也想用withColumnRenamed重命名計算列

df.groupBy(['ID','VAL']).count().show()