無法在Spark中配置ORC屬性

我正在使用Spark 1.6（Cloudera 5.8.2）並嘗試下面的方法來配置ORC屬性。但它不影響輸出。無法在Spark中配置ORC屬性

下面是我試過的代碼片段。

DataFrame dataframe = 
       hiveContext.createDataFrame(rowData, schema); 
dataframe.write().format("orc").options(new HashMap(){ 
      { 

       put("orc.compress","SNAPPY"); 
       put("hive.exec.orc.default.compress","SNAPPY"); 

       put("orc.compress.size","524288"); 
       put("hive.exec.orc.default.buffer.size","524288"); 


       put("hive.exec.orc.compression.strategy", "COMPRESSION"); 

      } 
     }).save("spark_orc_output");

除此之外，我試着在hive-site.xml和hiveContext對象中設置這些屬性。

配置單元 - 輸出上的ororfumpump確認沒有應用配置。 Orcfilingump代碼片段如下。

Compression: ZLIB 
Compression size: 262144

來源

2017-01-20 Vijay Kumar Reddy Chinnam

你在這裏犯了兩個不同的錯誤。我不怪你;我一直在那裏......

問題＃1
orc.compress其餘的都沒有火花DataFrameWriter選項。他們是蜂房配置屬性，必須創建hiveContext對象之前被定義...

無論是在hive-site.xml可在啓動時

或在代碼星火，通過重新創建SparkContext ...

sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
sc.stop
val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

[編輯]星火2.X腳本將成爲...
spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
spark.close
val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy

問題2：
星火使用自己的SERDE庫ORC（和實木複合地板，JSON，CSV，等等），所以它並沒有兌現標準的Hadoop /蜂巢性質。

Parquet有一些Spark特有的屬性，它們是well documented。但是，再次，這些屬性必須在創建（或重新創建）hiveContext之前設置。

對於ORC和其他格式，您必須求助於格式特定的DataFrameWriter選項;引用最新的JavaDoc ...

您可以設置寫入ORC 文件以下ORC特定信息選項（S）：
•compression（默認snappy）：壓縮編解碼器使用時保存到文件。這可以是已知的不區分大小寫的縮短名稱（none,snappy,zlib和lzo）之一。這將覆蓋orc.compress

請注意，默認壓縮編解碼器已隨Spark 2更改;在此之前，它是zlib

所以，你可以設置的唯一的事情就是壓縮編解碼器，使用

dataframe.write().format("orc").option("compression","snappy").save("wtf")

來源

2017-01-20 15:16:08

我嘗試下面的代碼。它並沒有將zlib的壓縮變成活潑。 SparkConf sc = new SparkConf（）; sc.set（「orc.compress」，「snappy」）; sc.set（「orc.compress.size」，「524288」）; sc.setAppName（「test_orc_config」）; JavaSparkContext jsc = new JavaSparkContext（sc）; ..... HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext（jsc.sc（））; DataFrame df = hiveContext.createDataFrame（rowData，schema）;格式（「orc」）。選項（「壓縮」，「snappy」）。save（「spark_orc_output」）; –

我嘗試配置hive-site.xml。它不起作用。 –

所以你最後的選擇是最簡單的選擇：'***。option（「compression」，「snappy」）。***' –

無法在Spark中配置ORC屬性

回答

相關問題