我用pyspark和工作有以下數據框:如何從行創建列,然後在蟒蛇火花subsequesnt列值
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
我想是讓SID 4178爲一列,並把圓比作爲其行值。結果應如下所示:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
列數是具有相同舍入的sid數ratio
。
如果SID不存在任何ID,然後在sid
列必須包含0
多少列,應在最終輸出?獨特的sids? –
什麼樣的sids與4178屬於同一個「組」? sid 4385和4390有什麼特別之處?這是通過四捨五入的比例? –
好吧,讓我們只是說這些是主鍵ID的幫手,將數據分組在一起 –