OutOfMemoryError，而SparkR中的Logistic迴歸

我已成功安裝Apache Spark，Hadoop，並通過Ubuntu 12.04（單機獨立模式）進行邏輯迴歸。還用小型csv數據集進行測試，但它不適用於具有269369行的大型數據集。OutOfMemoryError，而SparkR中的Logistic迴歸

library(SparkR) 
sc <- sparkR.init() 
iterations <- as.integer(11) 
D <- 540 

readPartition <- function(part){ 
part = strsplit(part, ",", fixed = T) 
list(matrix(as.numeric(unlist(part)), ncol = length(part[[1]]))) 
} 
w <- runif(n=D, min = -1, max = 1) 

cat("Initial w: ", w, "\n") 

# Compute logistic regression gradient for a matrix of data points 
gradient <- function(partition) { 
    partition = partition[[1]] 
    Y <- partition[, 1] # point labels (first column of input file) 

    X <- partition[, -1] # point coordinates 
    # For each point (x, y), compute gradient function 
    #print(w) 
    dot <- X %*% w  
    logit <- 1/(1 + exp(-Y * dot)) 
    grad <- t(X) %*% ((logit - 1) * Y) 
    list(grad) 
} 


for (i in 1:iterations) { 
    cat("On iteration ", i, "\n") 
    w <- w - reduce(lapplyPartition(points, gradient), "+") 
} 

> points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/henry/cdata_mr.csv"), readPartition))

錯誤消息我：數據

14/10/07 01:47:16 INFO FileInputFormat: Total input paths to process : 1 
14/10/07 01:47:28 WARN CacheManager: Not enough space to cache partition rdd_23_0 in memory! Free memory is 235841615 bytes. 
14/10/07 01:47:42 WARN CacheManager: Not enough space to cache partition rdd_23_1 in memory! Free memory is 236015334 bytes. 
14/10/07 01:47:55 WARN CacheManager: Not enough space to cache partition rdd_23_2 in memory! Free memory is 236015334 bytes. 
14/10/07 01:48:10 WARN CacheManager: Not enough space to cache partition rdd_23_3 in memory! Free memory is 236015334 bytes. 
14/10/07 01:48:29 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 17) 
java.lang.OutOfMemoryError: Java heap space 
    at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.read(RRDD.scala:144) 
    at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.<init>(RRDD.scala:156) 
    at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:129) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) 
    at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:120) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
    at org.apache.spark.scheduler.Task.run(Task.scala:54) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:701) 
14/10/07 01:48:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]

尺寸（樣品）：

data <- read.csv("/home/Henry/data.csv") 

dim(data) 

[1] 269369 541

我自己也嘗試舉辦了本地文件系統相同的csv文件，以及在HDFS上。我認爲它需要更多的Hadoop數據手冊來存儲大型數據集？如果是的話，我該如何設置Spark Hadoop集羣來擺脫這種情況。（或者我做錯了什麼）

提示：我認爲增加Java和Spark堆空間將幫助我運行此操作。我嘗試了很多，但沒有成功。任何人都可以知道爲兩者增加堆空間的方式。

來源

2014-10-06 Hanry

您可以嘗試將spark.executor.memory設置爲更大的值，如文檔here？作爲信封后計算，假設數據集中的每個條目佔用4個字節，則整個文件在內存中的開銷將爲269369 * 541 * 4 bytes ~= 560MB，該值超過該參數的默認512m值。

舉個例子，你可以試試（假設每個工作節點集羣中的已超過1GB的內存更多可用）：

sc <- sparkR.init("local[2]", "SparkR", "/home/spark", 
        list(spark.executor.memory="1g"))

來源

2014-10-12 16:47:23 Covi

OutOfMemoryError，而SparkR中的Logistic迴歸

回答

相關問題