Spark的int96時間類型

當您在spark中創建時間戳列並保存到parquet時，將得到一個12字節的整數列類型（int96）;我收集的數據在Julian日分爲6個字節，在一天內分爲6個字節（納秒）。Spark的int96時間類型

這不符合任何實木複合地板logical type。然後，實木複合地板文件中的模式不會顯示列是除整數之外的任何東西。

我的問題是，Spark如何知道加載這樣的列作爲時間戳而不是大整數？

2017-03-06 mdurant

實際上它是8 + 4字節，而不是6 + 6.有一個拉取請求來記錄這種類型，請參閱https://github.com/apache/parquet-format/pull/49。 – Zoltan

很對，抱歉。 – mdurant

語義是基於元數據確定的。我們需要一些進口：

import org.apache.parquet.hadoop.ParquetFileReader 
import org.apache.hadoop.fs.{FileSystem, Path} 
import org.apache.hadoop.conf.Configuration

示例數據：

val path = "/tmp/ts" 

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts") 
    .withColumn("ts", $"ts".cast("timestamp")) 
    .write.mode("overwrite").parquet(path)

和Hadoop配置：

val conf = spark.sparkContext.hadoopConfiguration 
val fs = FileSystem.get(conf)

現在，我們可以訪問星火元數據：

ParquetFileReader 
    .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path))) 
    .get(0) 
    .getParquetMetadata 
    .getFileMetaData 
    .getKeyValueMetaData 
    .get("org.apache.spark.sql.parquet.row.metadata")

和結果是：

String = {"type":"struct","fields: [ 
    {"name":"id","type":"integer","nullable":false,"metadata":{}}, 
    {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

等價信息也可以存儲在Metastore中。

根據官方的文檔，這是用於實現與配置單元和帕拉兼容性：

一些鑲木產生系統，特別是帕拉和配置單元，存儲時間戳到INT96。該標誌告訴Spark SQL將INT96數據解釋爲時間戳以提供與這些系統的兼容性。

並且可以使用spark.sql.parquet.int96AsTimestamp屬性來控制。

來源

2017-03-06 16:30:56 user6910411

Spark的int96時間類型

回答

相關問題