2016-10-27 51 views
2

我有一個用Scala編寫的Spark程序,它從HDFS讀取一個CSV文件,計算一個新列並將其保存爲一個parquet文件。我正在YARN集羣中運行程序。但是,每次我嘗試啓動它時,執行者都會在某個時刻出現此錯誤。在YARN模式下的Spark作業失敗

你能幫我找到可能導致這個錯誤的原因嗎?

從執行日誌

16/10/27 15:58:10 WARN storage.BlockManager: Putting block rdd_12_225 failed due to an exception 
16/10/27 15:58:10 WARN storage.BlockManager: Block rdd_12_225 could not be removed as it was not found on disk or in memory 
16/10/27 15:58:10 ERROR executor.Executor: Exception in task 225.0 in stage 4.0 (TID 465) 
java.io.IOException: Stream is corrupted 
    at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) 
    at org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
    at java.io.DataInputStream.readInt(DataInputStream.java:387) 
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.readSize(UnsafeRowSerializer.scala:113) 
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.<init>(UnsafeRowSerializer.scala:120) 
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3.asKeyValueIterator(UnsafeRowSerializer.scala:110) 
    at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:66) 
    at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:62) 
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) 
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) 
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) 
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) 
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:118) 
    at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryRelation.scala:110) 
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) 
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) 
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) 
    at org.apache.spark.scheduler.Task.run(Task.scala:86) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 15385 of input buffer 
    at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) 
    at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205) 
    ... 41 more 

編輯:

有代碼中使用

var df = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls", "true").csv(hdfsFileURLIn).repartition(nPartitions) 
df.printSchema() 
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK) 
df.repartition(nPartitions, $"ipix").write.mode("overwrite").option("spark.hadoop.dfs.replication", 1).parquet(hdfsFileURLOut) 

用戶功能A2P只是以兩個雙和返回等雙

我需要說的是,這對於較小的CSV(〜1Go)效果很好,但是是錯誤的每臺次大的人(〜15Go)發生

編輯2: 繼建議我禁用了重新分區,我用StorageLevel.DISK_ONLY

有了這個,我沒有得到會將塊RDD _ ** ***失敗,原因是一個例外,但仍然存在與LZ4異常(流已損壞):

16/10/28 07:53:00 ERROR util.Utils: Aborting task 
java.io.IOException: Stream is corrupted 
    at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:211) 
    at org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:353) 
    at java.io.DataInputStream.read(DataInputStream.java:149) 
    at org.spark_project.guava.io.ByteStreams.read(ByteStreams.java:899) 
    at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:733) 
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) 
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) 
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) 
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) 
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:254) 
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) 
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) 
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345) 
    at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) 
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) 
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 
    at org.apache.spark.scheduler.Task.run(Task.scala:86) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 12966 of input buffer 
    at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) 
    at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205) 
    ... 25 more 

編輯3:我設法還去除第二再分配啓動它沒有任何錯誤(一使用列ipix重新分區)我會看看這個方法的文檔

編輯4:這是奇怪的,偶爾有些執行人失敗分段故障:

# 
# A fatal error has been detected by the Java Runtime Environment: 
# 
# SIGSEGV (0xb) at pc=0x00007f48d8a47f2c, pid=3501, tid=0x00007f48cc60c700 
# 
# JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 1.8.0_102-b14) 
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode linux-amd64 compressed oops) 
# Problematic frame: 
# J 4713 C2 org.apache.spark.unsafe.types.UTF8String.hashCode()I (18 bytes) @ 0x00007f48d8a47f2c [0x00007f48d8a47e60+0xcc] 
# 
# Core dump written. Default location: /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/core or core.3501 
# 
# An error report file with more information is saved as: 
# /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1477580152295_0008/container_1477580152295_0008_01_000006/hs_err_pid3501.log 
# 
# If you would like to submit a bug report, please visit: 
# http://bugreport.java.com/bugreport/crash.jsp 
# 

我檢查了內存和我所有的執行者總是有足夠的空閒內存(至少6Go)

編輯4:所以我測試了多個文件,並執行總是成功,但有時執行人失敗(與上述錯誤),並再次啓動YARN

+0

加入您的代碼,瞭解更多.. – Shankar

+0

@Shankar done。 –

+1

你嘗試沒有重新分配?只是一個猜測.. – Shankar

回答

0

你正在使用哪個版本的lz4-java?這可能與版本1.1.2中已修復的問題有關 - 請參閱此bug report

另外,我對您的函數a2p很好奇。理想情況下,應該將兩個Column對象作爲輸入,而不僅僅是雙精度(除非將其註冊爲UDF)。

+0

是的,我用udf註冊了它。編輯:我怎麼知道LZ4的版本? –

+0

您應該能夠找到它,例如,在jar清單文件或您的項目的依賴 – ShirishT

+0

我有1.3。0 –

0

進入相同的問題。

症狀看起來完全是這樣的problem: SPARK-18105

截止到2017年1月29日,它尚未確定。

相關問題