如何在Java中爲Hadoop作業定義ParquetOutputFormat的parquet模式？

我在Java中的Hadoop的工作，其中有序列輸出格式：如何在Java中爲Hadoop作業定義ParquetOutputFormat的parquet模式？

job.setOutputFormatClass(SequenceFileOutputFormat.class);

我想用木地板格式而不是。我試着將它設置在天真的方式：

job.setOutputFormatClass(ParquetOutputFormat.class); 
ParquetOutputFormat.setOutputPath(job, output); 
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP); 
ParquetOutputFormat.setCompressOutput(job, true);

但是當談到編寫工作的結果到磁盤中，鮑勃將失敗：

Error: java.lang.NullPointerException: writeSupportClass should not be null 
    at parquet.Preconditions.checkNotNull(Preconditions.java:38) 
    at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:326)

看來，是實木複合地板需要一個架構TE設置，但我找不到一本手冊或指南，在我的情況下如何做到這一點。我的Reducer類嘗試通過使用org.apache.hadoop.io.LongWritable作爲關鍵字和作爲值在每行上寫下3個長值。

如何爲此定義模式？

來源

2017-03-16 Viacheslav Shalamov

您必須爲您的工作指定一個「parquet.hadoop.api.WriteSupport」執行。（比如：「parquet.proto.ProtoWriteSupport」爲protobuf的或「parquet.avro.AvroWriteSupport」爲阿夫羅）

ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);

使用的Protobuf時，然後指定protobufClass：

ProtoParquetOutputFormat.setProtobufClass(job, your-protobuf-class.class);

和使用時阿夫羅，像這樣引入模式：

AvroParquetOutputFormat.setSchema(job, your-avro-object.SCHEMA);

來源

2017-06-14 10:19:35 Mohammad

如何在Java中爲Hadoop作業定義ParquetOutputFormat的parquet模式？

回答

相關問題