2017-02-15 151 views
1

希望這是相當初級的。我有一個包含Date列的Spark數據框,我想添加一個自該日期以來的天數的新列。谷歌福正在讓我失望。計算在pyspark的兩個日期之間的時間

這是我已經試過:

from pyspark.sql.types import * 
import datetime 
today = datetime.date.today() 

schema = StructType([StructField("foo", DateType(), True)]) 
l = [(datetime.date(2016,12,1),)] 
df = sqlContext.createDataFrame(l, schema) 
df = df.withColumn('daysBetween',today - df.foo) 
df.show() 

失敗,出現錯誤:

u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 - foo)' requires (numeric or calendarinterval) type, not date;"

我試着擺弄周圍卻無處得到。我不認爲這太難。誰能幫忙?

回答

3

OK,理解了它

from pyspark.sql.types import * 
import pyspark.sql.functions as funcs 
import datetime 
today = datetime.date(2017,2,15) 

schema = StructType([StructField("foo", DateType(), True)]) 
l = [(datetime.date(2017,2,14),)] 
df = sqlContext.createDataFrame(l, schema) 
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo)) 
df.collect() 

回報[Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]

2

您只需做到以下幾點:

import pyspark.sql.functions as F 

df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo)) 
+0

以便其他人可以知道:區別在天https://開頭的火花.apache.org /文檔/ 2.1.0/API /蟒蛇/ pyspark.sql.html#pyspark.sql.functions.datediff – gabra

相關問題