2013-06-21 44 views
1

我使用火花0.7.2在獨立模式下使用以下驅動程序來處理〜90GB:使用7-工人和1個不同的主logdata的(壓縮19GB):火花獨立模式:連接到127.0.1.1:<PORT>拒絕

System.setProperty("spark.default.parallelism", "32") 
val sc = new SparkContext("spark://10.111.1.30:7077", "MRTest", System.getenv("SPARK_HOME"), Seq(System.getenv("NM_JAR_PATH"))) 
val logData = sc.textFile("hdfs://10.111.1.30:54310/logs/") 
val dcxMap = logData.map(line => (line.split("\\|")(0), 
            line.split("\\|")(9))) 
            .reduceByKey(_ + " || " + _) 
dcxMap.saveAsTextFile("hdfs://10.111.1.30:54310/out") 

所有階段1的ShuffleMapTasks完成後:

Stage 1 (reduceByKey at DcxMap.scala:31) finished in 111.312 s 

它提交階段0:

Submitting Stage 0 (MappedRDD[6] at saveAsTextFile at DcxMap.scala:38), which is now runnable 

有些系列化後,它打印

spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host23 
spark.MapOutputTracker - Size of output statuses for shuffle 0 is 2008 bytes 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host21 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host22 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host26 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host24 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host27 
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host28 

在此之後,什麼都不會發生了,也top表明,現在工人都處於閒置狀態。 如果我查看日誌在工人的機器,在他們每個人的同樣的事情發生:

13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:34288] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:36040] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:50467] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:60833] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:49893] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:39907] 

然後,這些「發起連接」嘗試每個,它在每個拋出同樣的錯誤工作人員(顯示host27的日誌爲例,只是第一次出現錯誤):

13/06/21 07:32:25 WARN network.SendingConnection: Error finishing connection to host27/127.0.1.1:49893 
java.net.ConnectException: Connection refused 
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) 
    at spark.network.SendingConnection.finishConnect(Connection.scala:221) 
    at spark.network.ConnectionManager.spark$network$ConnectionManager$$run(ConnectionManager.scala:127) 
    at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:70) 

爲什麼會發生這種情況?這似乎是工作人員可以互相溝通很好,唯一的問題似乎當他們想將消息發送到自己發生;在上面的例子中,host27嘗試發送6條消息給自己,但是失敗了6次。發送消息給其他工作正常。 有人有想法嗎?

編輯:也許它有使用127.0火花做。 0.1,而不是127.0。 0.1? /etc/hosts如下所示:

127.0.0.1  localhost 
127.0.1.1  host27.<ourdomain> host27 

回答

0

我發現這個問題是關係到this問題。 但是,對我來說是工人設置SPARK_LOCAL_IP沒有解決問題。 我不得不改變/etc/hosts到:

127.0.0.1  localhost 

,現在它運行平穩。