2017-07-28 41 views
2

我想執行的一些數據幀一個簡單的SQL查詢火花殼查詢增加了1周的時間間隔在一定日期如下:日期和間隔增加在SparkSQL

原始查詢:

scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show() 

現在,當我做了一些測試:

scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show() 

我得到的結果正確

+----------------------------------------------------------------------------+ 
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)| 
+----------------------------------------------------------------------------+ 
|                 1999-09-26| 
+----------------------------------------------------------------------------+ 

(剛剛加7天到19 = 26)

但是當我剛剛改變1997年而不是1999年,結果改變了!

scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show() 

+----------------------------------------------------------------------------+ 
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)| 
+----------------------------------------------------------------------------+ 
|                 1997-09-25| 
+----------------------------------------------------------------------------+ 

爲什麼reuslts改變了?它不是26而是25嗎?

所以,這是涉及到某種itermediate計算損失或者我失去了一些東西在sparkSQL的錯誤嗎?

回答

6

這可能是轉換爲本地時間的問題。 INTERVAL投射數據以TIMESTAMP,然後返回到DATE

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain 
== Physical Plan == 
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19] 
+- Scan OneRowRelation[] 

(注意在第二和第三CASTs)和火花已知是inconsequent when handling timestamps

DATE_ADD應該表現出更加穩定的行爲:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain 
== Physical Plan == 
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27] 
+- Scan OneRowRelation[] 
+2

也不一致:如果您有跨越兩個時區,時間戳日期轉換集羣完全分崩離析(除非您使用每次明確的時區的方法)。 –