我試圖解決我在我的火花設置中看到的內存溢出問題,此時,我無法就我爲什麼看到這一點做出具體分析。編寫數據框到鑲木地板或卡夫卡時,我總是看到這個問題。我的數據幀有5000行。它的模式是寫入實木複合地板/卡夫卡:線程中的異常「dag-scheduler-event-loop」java.lang.OutOfMemoryError
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: array (nullable = true)
| |-- element: string (containsNull = true)
|-- E: array (nullable = true)
| |-- element: string (containsNull = true)
|-- F: double (nullable = true)
|-- G: array (nullable = true)
| |-- element: double (containsNull = true)
|-- H: integer (nullable = true)
|-- I: double (nullable = true)
|-- J: double (nullable = true)
|-- K: array (nullable = true)
| |-- element: double (containsNull = false)
其中列G可以具有高達16MB的單元大小。我的數據幀總大小約爲10GB,分爲12個分區。在編寫之前,我試圖用repartition()創建48個分區,但即使我沒有重新分區編寫,也會出現問題。在這個例外時,我只有一個Dataframe緩存,大小約爲10GB。我的驅動程序有19GB的可用內存,2個執行程序每個有8GB的可用內存。 spark版本是2.1.0.cloudera1和scala版本是2.11.8。
我有以下設置:
spark.driver.memory 35G
spark.executor.memory 25G
spark.executor.instances 2
spark.executor.cores 3
spark.driver.maxResultSize 30g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1g
spark.rdd.compress true
spark.rpc.message.maxSize 2046
spark.yarn.executor.memoryOverhead 4096
例外回溯是
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
任何見解?
有類似的問題。增加Java堆大小解決了它。請參閱https://stackoverflow.com/q/1565388/5039312 – Marco