0

我試圖解決我在我的火花設置中看到的內存溢出問題,此時,我無法就我爲什麼看到這一點做出具體分析。編寫數據框到鑲木地板或卡夫卡時,我總是看到這個問題。我的數據幀有5000行。它的模式是寫入實木複合地板/卡夫卡:線程中的異常「dag-scheduler-event-loop」java.lang.OutOfMemoryError

root 

    |-- A: string (nullable = true) 
    |-- B: string (nullable = true) 
    |-- C: string (nullable = true) 
    |-- D: array (nullable = true) 
    | |-- element: string (containsNull = true) 
    |-- E: array (nullable = true) 
    | |-- element: string (containsNull = true) 
    |-- F: double (nullable = true) 
    |-- G: array (nullable = true) 
    | |-- element: double (containsNull = true) 
    |-- H: integer (nullable = true) 
    |-- I: double (nullable = true) 
    |-- J: double (nullable = true) 
    |-- K: array (nullable = true) 
    | |-- element: double (containsNull = false) 

其中列G可以具有高達16MB的單元大小。我的數據幀總大小約爲10GB,分爲12個分區。在編寫之前,我試圖用repartition()創建48個分區,但即使我沒有重新分區編寫,也會出現問題。在這個例外時,我只有一個Dataframe緩存,大小約爲10GB。我的驅動程序有19GB的可用內存,2個執行程序每個有8GB的可用內存。 spark版本是2.1.0.cloudera1和scala版本是2.11.8。

我有以下設置:

spark.driver.memory  35G 
spark.executor.memory 25G 
spark.executor.instances 2 
spark.executor.cores 3 
spark.driver.maxResultSize  30g 
spark.serializer  org.apache.spark.serializer.KryoSerializer 
spark.kryoserializer.buffer.max 1g 
spark.rdd.compress  true 
spark.rpc.message.maxSize  2046 
spark.yarn.executor.memoryOverhead  4096 

例外回溯是

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError 
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) 
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) 
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) 
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) 
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) 
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) 
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) 
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) 
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) 
    at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991) 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764) 
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) 
    at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 

任何見解?

+0

有類似的問題。增加Java堆大小解決了它。請參閱https://stackoverflow.com/q/1565388/5039312 – Marco

回答

-1

我們終於找到了問題。我們在scala中對k行大小爲4的5000行數據幀執行kfold logistic迴歸。分類完成後,我們基本上得到了4個大小爲1250的測試輸出數據框,每個數據框至少由200個分區進行分區。所以我們在5000行數據上擁有超過800個分區。然後代碼將重新分區爲48個分區。我們的系統可能無法處理這種重新分配,可能是由於洗牌。爲了解決這個問題,我們將每個摺疊輸出數據幀重新分區爲一個較小的數字(而不是在組合的數據幀上進行),並解決了這個問題。