如何在Apache Spark Cluster模式下運行更多的執行程序

我有50名工人，我想在我的所有工人上運行我的工作。
在主：8080，我可以看到有所有工人，
在主機：4040 /執行人，我可以看到50個執行人，
但是當我跑我的工作，信息顯示是這樣的：
如何在Apache Spark Cluster模式下運行更多的執行程序

14/10/19 14:57:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 
14/10/19 14:57:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave11, NODE_LOCAL, 1302 bytes) 
14/10/19 14:57:07 INFO nio.ConnectionManager: Accepted connection from [slave11/10.10.10.21:42648] 
14/10/19 14:57:07 INFO nio.SendingConnection: Initiating connection to [slave11/10.10.10.21:54398] 
14/10/19 14:57:07 INFO nio.SendingConnection: Connected to [slave11/10.10.10.21:54398], 1 messages pending 
14/10/19 14:57:07 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave11:54398 (size: 2.4 KB, free: 267.3 MB) 
14/10/19 14:57:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave11:54398 (size: 18.4 KB, free: 267.2 MB) 
14/10/19 14:57:12 INFO storage.BlockManagerInfo: Added rdd_2_0 in memory on slave11:54398 (size: 87.4 MB, free: 179.8 MB) 
14/10/19 14:57:12 INFO scheduler.DAGScheduler: Stage 0 (first at GeneralizedLinearAlgorithm.scala:141) finished in 5.473 s 
14/10/19 14:57:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 5463 ms on slave11 (1/1) 
14/10/19 14:57:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

而且我的工作像這樣的代碼：（命令行）

master: $ ./spark-shell --master spark://master:7077

這個（Scala代碼）：

import org.apache.spark.SparkContext 
import org.apache.spark.mllib.classification.SVMWithSGD 
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics 
import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.linalg.Vectors 
import org.apache.spark.mllib.util.MLUtils 

val fileName = "bc.txt" 
val data = sc.textFile(fileName) 

val splits = data.randomSplit(Array(0.9, 0.1), seed = 11L) 
val training = splits(0).cache() 
val test = splits(1) 

val training_1 = training.map { line => 
val parts = line.split(' ') 
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray)) 
} 

val test_1 = test.map { line => 
val parts = line.split(' ') 
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray)) 
} 
val numIterations = 200 

val model = SVMWithSGD.train(training_1, numIterations)

我的問題是，爲什麼只有一個或我的羣集在兩個（有時）任務運行？
是否有任何方式來配置任務的數量，或者是由調度程序自動調度？
當我的工作在兩個任務上運行時，它將與我在master上觀察到的兩個執行程序一起運行：4040，
它會給2倍加速，所以我想在所有執行程序上運行我的工作，我該怎麼做？

謝謝大家。

來源

2014-10-19 Chen-Lin Yeh

可以使用minPartitions參數textFile設定最小數量的任務，如：

val data = sc.textFile(fileName, 10)

然而，更多的分區通常意味着更多的網絡流量，因爲更多的分區，使星火很難調度任務的本地執行者運行。你需要自己找到一個餘額號碼minPartitions。

來源

2014-10-20 06:40:19 zsxwing

感謝您的回答，但我更改我的代碼爲「val data = sc.textFile（fileName，50）」就像您說的，在master：4040可以看到，RDD塊現在總計50個（50個分區），但仍然只有兩臺機器運行所有任務，我想安排50個分區到我的所有機器，如何設置它？再次感謝！ – 2014-10-20 09:29:36

[link] https://www.dropbox.com/s/5qqv1t0fgudzllt/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%202014-10-20%20 ％E4％B8％8B％E5％8D％885.33.22.jpg？dl = 0 – 2014-10-20 09:35:13

每個員工有多少個核心？如果它大於25，是的，它可能只派遣兩臺機器。 – zsxwing 2014-10-20 11:18:55

如何在Apache Spark Cluster模式下運行更多的執行程序

回答

相關問題