將文本文件轉換爲Spark Java中的序列格式

在Spark Java中，如何將文本文件轉換爲序列文件？下面是我的代碼：將文本文件轉換爲Spark Java中的序列格式

SparkConf sparkConf = new SparkConf().setAppName("txt2seq"); 
    sparkConf.setMaster("local").set("spark.executor.memory", "1g"); 
    sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); 
    JavaSparkContext ctx = new JavaSparkContext(sparkConf); 

    JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt"); 
    infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class, SequenceFileOutputFormat.class);

而且我得到了下面的錯誤。

14/12/07 23:43:33 ERROR Executor: Exception in task ID 0 
java.io.IOException: Could not find a serializer for the Key class: 'java.lang.String'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization. 
    at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1176) 
    at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1091)

有沒有人有任何想法？謝謝！

來源

2014-12-08 Edamame

更改此：

JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt"); 
infile.saveAsNewAPIHadoopFile("outfile.seq", String.class, String.class, SequenceFileOutputFormat.class);

到

JavaPairRDD<String, String> infile = ctx.wholeTextFiles("input_txt"); 
JavaPairRDD<Text, Text> resultRDD = infile.mapToPair(f -> new Tuple2<>(new Text(f._1()), new Text(f._2()))); 
resultRDD.saveAsNewAPIHadoopFile("outfile.seq", Text.class, Text.class, SequenceFileOutputFormat.class);

來源

2015-04-07 07:17:48

將文本文件轉換爲Spark Java中的序列格式

回答

相關問題