0
說我有這樣一個數據幀:pyspark列AGG輸出分配名稱
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
import datetime
sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)
rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)),
('b',datetime.datetime(2014, 1, 27, 0, 0)),
('c',datetime.datetime(2014, 1, 31, 0, 0))])
testdf = sqlcontext.createDataFrame(rdd, ["id", "date"])
print(testdf.show())
print(testdf.printSchema())
給出了測試數據框:
+---+--------------------+
| id| date|
+---+--------------------+
| a|2014-01-09 00:00:...|
| b|2014-01-27 00:00:...|
| c|2014-01-31 00:00:...|
+---+--------------------+
root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
而且我想要得到的日期列的最大值:
max_date = testdf.agg(sf.max(sf.col('date'))).collect()
print(max_date)
給出:
[Row(max(date)=datetime.datetime(2014, 1, 31, 0, 0))]
如何在原有的操作應用自定義名稱本身出現的,而不是自動分配max(date)
,說max_date
,這樣我可以爲max_date[0]['max_date']
而不是max_date[0][0]
或max_date[0]['max(date)']
訪問的價值,也有訪問這個值的更好的辦法,一些Row()的屬性?