2014-05-12 53 views
2

當我如下運行我的火花Python代碼:火花引發的OutOfMemoryError

import pyspark 
conf = (pyspark.SparkConf() 
    .setMaster("local") 
    .setAppName("My app") 
    .set("spark.executor.memory", "512m")) 
sc = pyspark.SparkContext(conf = conf)  #start the conf 
data =sc.textFile('/Users/tsangbosco/Downloads/transactions') 
data = data.flatMap(lambda x:x.split()).take(all) 

文件大小約爲20G和我的電腦有8G的內存,當我以單機模式運行程序時,它引發的OutOfMemoryError:

Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space 
    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112) 
    at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112) 
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112) 
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112) 
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
    at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112) 
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259) 
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259) 
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) 
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884) 
    at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681) 
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666) 

火花無法處理大於我的內存的文件?你能告訴我如何解決它?

+0

http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space/22742982#22742982 – samthebest

回答

2

Spark可以處理某些情況。但是您使用take強制Spark將所有數據提取到數組(在內存中)。在這種情況下,您應該將它們存儲到文件中,例如使用saveAsTextFile

如果您有興趣查看部分數據,可以使用sampletakeSample

+0

如果我想使用MLIB訓練的所有功能樣本文件中怎麼能我發送所有樣本進行訓練?謝謝 – BoscoTsang

+0

其實,mlib有很多算法。你想使用哪一個? – zsxwing

+0

例如,我想用Logistic迴歸來訓練數據。該文件有幾列數據,每列是一個功能。我如何將一個特徵中的所有樣本訓練到Logistic迴歸? – BoscoTsang