2
我正在嘗試使用實木複合地板的最小/最大指數。我有問題一起以下/回答在這裏:Spark Parquet Statistics(min/max) integration如何查看parquet元數據中的最小/最大索引?
scala> val foo = spark.sql("select id, cast(id as string) text from range(1000)").sort("id")
scala> foo.printSchema
root
|-- id: long (nullable = false)
|-- text: string (nullable = false)
當我在看一個單獨的文件拼花我看不出有任何的最小/最大
> parquet-tools meta part-00000-tid-5174196010762120422-9
5fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
file: file:.../part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
id: REQUIRED INT64 R:0 D:0
text: REQUIRED BINARY O:UTF8 R:0 D:0
row group 1: RC:125 TS:1840 OFFSET:4
--------------------------------------------------------------------------------
id: INT64 GZIP DO:0 FPO:4 SZ:259/1044/4.03 VC:125 ENC:PLAIN,BIT_PACKED
text: BINARY GZIP DO:0 FPO:263 SZ:263/796/3.03 VC:125 ENC:PLAIN,BIT_PACKED
我已經試過.sortWithinPartitions( 「id」)具有相同的結果。
您是否找到解決方案? – RBanerjee
統計信息不會生成火花1.6 parquet-mr 1.5 – RBanerjee