0
我有一個使用Spark重構數據的問題。原始數據看起來是這樣的:PySpark中的Spark Pivot字符串
df = sqlContext.createDataFrame([
("ID_1", "VAR_1", "Butter"),
("ID_1", "VAR_2", "Toast"),
("ID_1", "VAR_3", "Ham"),
("ID_2", "VAR_1", "Jam"),
("ID_2", "VAR_2", "Toast"),
("ID_2", "VAR_3", "Egg"),
], ["ID", "VAR", "VAL"])
>>> df.show()
+----+-----+------+
| ID| VAR| VAL|
+----+-----+------+
|ID_1|VAR_1|Butter|
|ID_1|VAR_2| Toast|
|ID_1|VAR_3| Ham|
|ID_2|VAR_1| Jam|
|ID_2|VAR_2| Toast|
|ID_2|VAR_3| Egg|
+----+-----+------+
這是我儘量做到結構:
+----+------+-----+-----+
| ID| VAR_1|VAR_2|VAR_3|
+----+------+-----+-----+
|ID_1|Butter|Toast| Ham|
|ID_2| Jam|Toast| Egg|
+----+------+-----+-----+
我的想法是使用:
df.groupBy("ID").pivot("VAR").show()
但我得到以下錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'show'
任何建議!謝謝!
這可能是一個有效的答案,爲什麼卻向下投票? – eliasah