2017-05-03 63 views
1

我正在嘗試使用spark來讀取由hive生成的序列文件。當我嘗試訪問該文件時,我正面臨org.apache.spark.SparkException:由於階段失敗而導致作業中止:任務不可序列化:java.io.NotSerializableException:使用spark創建序列文件讀取問題Java

我已經嘗試了此問題的解決方法,該類可序列化,但我仍然面臨這個問題。我在這裏寫代碼片段,請讓我知道我在這裏錯過了什麼。

是因爲BytesWritable數據類型或其他導致問題的原因。

JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class); 
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){ 
public String call (Tuple2<BytesWritable,Text> row){ 
return row._2.toString()+"\n"; 

}).collect(); 
} 
+0

請張貼錯誤的堆棧跟蹤,這將是有益的,如果你能張貼整個代碼。 – code

回答

0

請找到整個堆棧跟蹤如下

17/05/04 19:00:54 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Task not serializable 
    org.apache.spark.SparkException: Task not serializable 
      at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) 
      at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) 
      at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) 
      at org.apache.spark.SparkContext.clean(SparkContext.scala:2078) 
      at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:331) 
      at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:330) 
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
      at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
      at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) 
      at org.apache.spark.rdd.RDD.map(RDD.scala:330) 
      at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:96) 
      at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:46) 
      at data_conversion.DataConversion.snapShotMigration(DataConversion.java:100) 
      at data_conversion.DataConversion.dataMigration(DataConversion.java:59) 
      at data_conversion.DataConversion.main(DataConversion.java:50) 
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
      at java.lang.reflect.Method.invoke(Method.java:497) 
      at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559) 
    Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext 
    Serialization stack: 
      - object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: [email protected]3cf135c2) 
      - field (class: data_conversion.DataConversion, name: jsCtx, type: class org.apache.spark.api.java.JavaSparkContext) 
      - object (class data_conversion.DataConversion, [email protected]) 
      - field (class: data_conversion.DataConversion$1, name: this$0, type: class data_conversion.DataConversion) 
      - object (class data_conversion.DataConversion$1, [email protected]) 
      - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function) 
      - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>) 
      at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) 
      at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) 
      at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) 
      at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) 
      ... 19 more 
0

這裏需要的是什麼,使工作

因爲我們使用HBase的存儲我們的數據和這個減速的結果輸出到HBase表,Hadoop告訴我們他不知道如何序列化我們的數據。這就是爲什麼我們需要幫助它。裏面設置設置變量io.serializations 你可以做到這一點火花相應

conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});