2017-05-19 51 views
0

我用pyspark和工作有以下數據框:如何從行創建列,然後在蟒蛇火花subsequesnt列值

+---------+----+--------------------+-------------------+ 
|  id| sid|    values|    ratio| 
+---------+----+--------------------+-------------------+ 
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675| 
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923| 
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174| 
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009| 
|139428308|4385|[2#2#2#3#4#4#4#4#...|   1.140625| 
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505| 
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584| 
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175| 
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453| 
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649| 
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512| 
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402| 
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067| 
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173| 
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505| 
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411| 
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333| 
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059| 
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539| 
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966| 
+---------+----+--------------------+-------------------+ 

我想是讓SID 4178爲一列,並把圓比作爲其行值。結果應如下所示:

+---------+-------+------+-------+ 
|  id| 4178 |4385 | 4390 |(if sid for id fill row with ratio) 
+---------+-------+------+-------+ 
| 6052791|0.32 | 0 | 0  |(if not fill with 0) 

id   4178 
6052791  0.32 

列數是具有相同舍入的sid數ratio

如果SID不存在任何ID,然後在sid列必須包含0

+1

多少列,應在最終輸出?獨特的sids? –

+0

什麼樣的sids與4178屬於同一個「組」? sid 4385和4390有什麼特別之處?這是通過四捨五入的比例? –

+0

好吧,讓我們只是說這些是主鍵ID的幫手,將數據分組在一起 –

回答

1

您需要一個groupby列,爲此我添加一個名爲sNo的新列。

import sqlContext.implicits._ 
    import org.apache.spark.sql.functions._ 

    val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675), 
    (6052791, 4178, 0.22673267326732675), 
    (6052791, 4179, 0.62673267326732675), 
    (6052791, 4180, 0.72673267326732675), 
    (6052791, 4179, 0.82673267326732675), 
    (6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio") 

    df.withColumn("sNo", lit(1)) 
    .groupBy("sNo") 
    .pivot("sid") 
    .agg(min("ratio")) 
    .show 

這將返回輸出

+---+-------------------+------------------+------------------+ 
|sNo|    4178|    4179|    4180| 
+---+-------------------+------------------+------------------+ 
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267| 
+---+-------------------+------------------+------------------+ 
+0

你不必定義這個人爲的'sNo',因爲你可以使用'groupBy'而不用任何列來表示所有的行都在一個組中。 –

1

這聽起來像一個支點,可能是在星火SQL(斯卡拉版)如下:

scala> ratios. 
    groupBy("id"). 
    pivot("sid"). 
    agg(first("ratio")). 
    show 
+-------+-------------------+ 
|  id|    4178| 
+-------+-------------------+ 
|6052791|0.32673267326732675| 
+-------+-------------------+ 

我仍然不確定如何選擇其他列(在你的例子中是4385和4390)。它似乎,你輪到ratio和搜索其他sid s將匹配。