1
我需要爲ID字段生成自動遞增的值。我的方法是使用windows函數和max函數。Spark數據框將窗口函數的結果添加到常規函數中,如max。自動增量
我試圖找到純數據框解決方案(無rdd)。
所以我做right-outer join
後,我得到這個數據幀:
df2 = sqlContext.createDataFrame([(1,2), (3, None), (5, None)], ['someattr', 'id'])
# notice null values? it's a new records that don't have id just yet.
# The task is to generate them. Preferably with one query.
df2.show()
+--------+----+
|someattr| id|
+--------+----+
| 1| 2|
| 3|null|
| 5|null|
+--------+----+
我需要生成id
場自動遞增的值。我的方法是使用windows功能
df2.withColumn('id', when(df2.id.isNull(), row_number().over(Window.partitionBy('id').orderBy('id')) + max('id')).otherwise(df2.id))
當我做到這一點的以下異常加薪:
AnalysisException Traceback (most recent call last)
<ipython-input-102-b3221098e895> in <module>()
10
11
---> 12 df2.withColumn('hello', when(df2.id.isNull(), row_number().over(Window.partitionBy('id').orderBy('id')) + max('id')).otherwise(df2.id)).show()
/Users/ipolynets/workspace/spark-2.0.0/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1371 """
1372 assert isinstance(col, Column), "col should be Column"
-> 1373 return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
1374
1375 @ignore_unicode_prefix
/Users/ipolynets/workspace/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/Users/ipolynets/workspace/spark-2.0.0/python/pyspark/sql/utils.pyc in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"expression '`someattr`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"
不知道這是什麼例外抱怨說實話。
請注意我如何將window
函數添加到常規函數max()
?
row_number().over(Window.partitionBy('id').orderBy('id')) + max('id')
我不知道這是甚至不允許。
哦..這是預期的查詢輸出。正如你可能已經想到的那樣。
+--------+----+
|someattr| id|
+--------+----+
| 1| 2|
| 3| 3|
| 5| 4|
+--------+----+