我建立與pyspark一個數據幀,像這樣從1值到n:添加一列包括在數據幀
+----+------+
| k| v|
+----+------+
|key1|value1|
|key1|value1|
|key1|value1|
|key2|value1|
|key2|value1|
|key2|value1|
+----+------+
我要添加一個「的rowNum」列使用「withColumn」方法,結果數據幀如下更改:
+----+------+------+
| k| v|rowNum|
+----+------+------+
|key1|value1| 1|
|key1|value1| 2|
|key1|value1| 3|
|key2|value1| 4|
|key2|value1| 5|
|key2|value1| 6|
+----+------+------+
rowNum的範圍從1到n,n等於原始數。我修改了代碼,像這樣:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy("v").orderBy('k')
my_df= my_df.withColumn("rowNum", F.rowNumber().over(w))
但是,我得到錯誤信息:
'module' object has no attribute 'rowNumber'
我換成ROWNUMBER()方法ROW_NUMBER,上面的代碼可以運行。但是,當我運行代碼:
my_df.show()
我再次得到了錯誤信息:
Py4JJavaError: An error occurred while calling o898.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
at org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate.doGenCode(interfaces.scala:342)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
at scala.Option.getOrElse(Option.scala:121)
這是最有可能的[這個](http://stackoverflow.com/questions/32086578/how-to-add-row-id-in-pyspark-dataframes)的誘惑。 –