2016-11-05 24 views
0

我有一個使用Spark重構數據的問題。原始數據看起來是這樣的:PySpark中的Spark Pivot字符串

df = sqlContext.createDataFrame([ 
    ("ID_1", "VAR_1", "Butter"), 
    ("ID_1", "VAR_2", "Toast"), 
    ("ID_1", "VAR_3", "Ham"), 
    ("ID_2", "VAR_1", "Jam"), 
    ("ID_2", "VAR_2", "Toast"), 
    ("ID_2", "VAR_3", "Egg"), 
], ["ID", "VAR", "VAL"]) 

>>> df.show() 
+----+-----+------+ 
| ID| VAR| VAL| 
+----+-----+------+ 
|ID_1|VAR_1|Butter| 
|ID_1|VAR_2| Toast| 
|ID_1|VAR_3| Ham| 
|ID_2|VAR_1| Jam| 
|ID_2|VAR_2| Toast| 
|ID_2|VAR_3| Egg| 
+----+-----+------+ 

這是我儘量做到結構:

+----+------+-----+-----+ 
| ID| VAR_1|VAR_2|VAR_3| 
+----+------+-----+-----+ 
|ID_1|Butter|Toast| Ham| 
|ID_2| Jam|Toast| Egg| 
+----+------+-----+-----+ 

我的想法是使用:

df.groupBy("ID").pivot("VAR").show() 

但我得到以下錯誤:

Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
AttributeError: 'GroupedData' object has no attribute 'show' 

任何建議!謝謝!

回答

1

您需要在pivot()之後添加聚合。如果確定只有一個「VAL」每個(「ID」,「VAR」)對,你可以先()使用方法:

from pyspark.sql import functions as f 

result = df.groupBy("ID").pivot("VAR").agg(f.first("VAL")) 
result.show() 

+----+------+-----+-----+ 
| ID| VAR_1|VAR_2|VAR_3| 
+----+------+-----+-----+ 
|ID_1|Butter|Toast| Ham| 
|ID_2| Jam|Toast| Egg| 
+----+------+-----+-----+ 
+0

這可能是一個有效的答案,爲什麼卻向下投票? – eliasah