火花數據框中：如何添加一個索引列

我想從1添加一列到行的數量。

我應該怎麼做，謝謝（斯卡拉）

2017-04-14 Liangpi

使用Scala，您可以使用：

import org.apache.spark.sql.functions._ 

df.withColumn("id",monotonicallyIncreasingId)

您可以參考這個exemple和Scala docs。

隨着Pyspark你可以使用：

from pyspark.sql.functions import monotonically_increasing_id 

df_index = df.select("*").withColumn("id", monotonically_increasing_id())

來源

2017-04-14 08:36:19 Omar14

我想知道爲什麼您爲scala編寫的代碼不適用於pyspark。即''df.withColumn（「id」，monotonicallyIncreasingId）' – anwartheravian

該scala代碼工作。由於但是我得到以下警告「警告：有一個棄用警告;與-deprecation的細節重新運行」 – Ajay

monotonicallyIncreasingId不保證「ID」將是「從1到行的數量」。從DOC：https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#monotonically_increasing_id--「所生成的ID是保證單調增加的和獨特的，但不是連續的「 – Gevorg

monotonically_increasing_id - 生成的ID保證是單調遞增的，獨特的，而不是連續的。

「我想從1添加一列到行的數量。」

讓我們說，我們有以下的DF

 
+--------+-------------+-------+ 
| userId | productCode | count | 
+--------+-------------+-------+ 
|  25 |  6001 |  2 | 
|  11 |  5001 |  8 | 
|  23 |   123 |  5 | 
+--------+-------------+-------+

要生成的ID從1

val w = Window.orderBy("count") 
val result = df.withColumn("index", row_number().over(w))

這起將增加通過增加計數值排序的索引列。

 
+--------+-------------+-------+-------+ 
| userId | productCode | count | index | 
+--------+-------------+-------+-------+ 
|  25 |  6001 |  2 |  1 | 
|  23 |   123 |  8 |  2 | 
|  11 |  5001 |  5 |  3 | 
+--------+-------------+-------+-------+

來源

2017-10-14 02:56:09

火花數據框中：如何添加一個索引列

回答

相關問題