2017-05-14 40 views
0

我有一個來自我粘貼了下面副本的大數據的pyspark數據框。我將如何添加一個列與每個桶的百分比?在pyspark數據框上計算百分比

enter image description here

感謝您的幫助!

+0

在這種情況下,什麼使得一個桶? –

+0

我想要233/sum(count),314/sum(count)..等等 – Balla13

回答

0

像下面的東西應該工作。

df = sc.parallelize([(1,'female',233), (None,'female',314),(0,'female',81),(1, None, 342), (1, 'male', 109)]).toDF().withColumnRenamed("_1","survived").withColumnRenamed("_2","sex").withColumnRenamed("_3","count") 
total = df.select("count").agg({"count": "sum"}).collect().pop()['sum(count)'] 
result = df.withColumn('percent', (df['count']/total) * 100) 
result.show() 

+--------+------+-----+------------------+ 
|survived| sex|count|   percent| 
+--------+------+-----+------------------+ 
|  1|female| 233| 21.59406858202039| 
| null|female| 314|29.101019462465246| 
|  0|female| 81| 7.506950880444857| 
|  1| null| 342| 31.69601482854495| 
|  1| male| 109|10.101946246524559| 
+--------+------+-----+------------------+ 
0

您需要: - 計算總和 - 尋找百分比 創建UDF - 和結果添加一列。