2
我已成功安裝Apache Spark,Hadoop,並通過Ubuntu 12.04(單機獨立模式)進行邏輯迴歸。還用小型csv數據集進行測試,但它不適用於具有269369行的大型數據集。OutOfMemoryError,而SparkR中的Logistic迴歸
library(SparkR)
sc <- sparkR.init()
iterations <- as.integer(11)
D <- 540
readPartition <- function(part){
part = strsplit(part, ",", fixed = T)
list(matrix(as.numeric(unlist(part)), ncol = length(part[[1]])))
}
w <- runif(n=D, min = -1, max = 1)
cat("Initial w: ", w, "\n")
# Compute logistic regression gradient for a matrix of data points
gradient <- function(partition) {
partition = partition[[1]]
Y <- partition[, 1] # point labels (first column of input file)
X <- partition[, -1] # point coordinates
# For each point (x, y), compute gradient function
#print(w)
dot <- X %*% w
logit <- 1/(1 + exp(-Y * dot))
grad <- t(X) %*% ((logit - 1) * Y)
list(grad)
}
for (i in 1:iterations) {
cat("On iteration ", i, "\n")
w <- w - reduce(lapplyPartition(points, gradient), "+")
}
> points <- cache(lapplyPartition(textFile(sc, "hdfs://localhost:54310/henry/cdata_mr.csv"), readPartition))
錯誤消息我:數據
14/10/07 01:47:16 INFO FileInputFormat: Total input paths to process : 1
14/10/07 01:47:28 WARN CacheManager: Not enough space to cache partition rdd_23_0 in memory! Free memory is 235841615 bytes.
14/10/07 01:47:42 WARN CacheManager: Not enough space to cache partition rdd_23_1 in memory! Free memory is 236015334 bytes.
14/10/07 01:47:55 WARN CacheManager: Not enough space to cache partition rdd_23_2 in memory! Free memory is 236015334 bytes.
14/10/07 01:48:10 WARN CacheManager: Not enough space to cache partition rdd_23_3 in memory! Free memory is 236015334 bytes.
14/10/07 01:48:29 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 17)
java.lang.OutOfMemoryError: Java heap space
at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.read(RRDD.scala:144)
at edu.berkeley.cs.amplab.sparkr.RRDD$$anon$2.<init>(RRDD.scala:156)
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:129)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at edu.berkeley.cs.amplab.sparkr.RRDD.compute(RRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
14/10/07 01:48:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
尺寸(樣品):
data <- read.csv("/home/Henry/data.csv")
dim(data)
[1] 269369 541
我自己也嘗試舉辦了本地文件系統相同的csv文件,以及在HDFS上。我認爲它需要更多的Hadoop數據手冊來存儲大型數據集?如果是的話,我該如何設置Spark Hadoop集羣來擺脫這種情況。 (或者我做錯了什麼)
提示:我認爲增加Java和Spark堆空間將幫助我運行此操作。我嘗試了很多,但沒有成功。任何人都可以知道爲兩者增加堆空間的方式。