2017-02-27 28 views
1

我有一個spark集羣(7 * 2核心),它是在spark 2.0.2上設置的,位於hdfs集羣旁邊。Spark執行程序無法連接到神祕的端口35529

當我使用Jupyter讀取一些hdfs文件時,我看到應用程序啓動,使用14個內核和3個,但是由於網絡不可能連接到一個奇怪的「本地主機」端口,所有worker都無法啓動任何任務35529.

spark  = SparkSession.builder.master(master).appName(appName).config("spark.executor.instances", 3).getOrCreate() 
sc   = spark.sparkContext 
hdfs_master = "hdfs://xx.xx.xx.xx:8020" 
hdfs_path = "/logs/cycliste_debug/2017/2017_02/2017_02_20/23h/*" 
infos  = sc.textFile(hdfs_master+hdfs_path) 

我看到: enter image description here

(這讓我覺得很奇怪,看14個內核分配時,只有3 * 2是可能的:即spark.executor.instances * CPU的NB節點):

這裏是羣集摘要:

執行人概要APP-20170227140938-0009:

ExecutorID Worker Cores Memory State ▾ Logs 
1488 worker-20170227125912-xx.xx.xx.xx-38028 2 1024 RUNNING stdout stderr 
1489 worker-20170227125954-xx.xx.xx.xx-48962 2 1024 RUNNING stdout stderr 
5  worker-20170227125959-xx.xx.xx.xx-48149 2 1024 RUNNING stdout stderr 
1486 worker-20170227130012-xx.xx.xx.xx-47639 2 1024 RUNNING stdout stderr 
1490 worker-20170227130027-xx.xx.xx.xx-44921 2 1024 RUNNING stdout stderr 
1485 worker-20170227130152-xx.xx.xx.xx-50620 2 1024 RUNNING stdout stderr 
1487 worker-20170227130248-xx.xx.xx.xx-42100 2 1024 RUNNING stdout stderr 

和誤差的一個工人的例子:

stderr的日誌頁APP-20170227140938 -0009/1488:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
17/02/27 14:37:57 INFO CoarseGrainedExecutorBackend: Started daemon with process name: [email protected] 
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for TERM 
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for HUP 
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for INT 
17/02/27 14:37:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
17/02/27 14:37:58 INFO SecurityManager: Changing view acls to: spark 
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls to: spark 
17/02/27 14:37:58 INFO SecurityManager: Changing view acls groups to: 
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls groups to: 
17/02/27 14:37:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set() 
17/02/27 14:38:01 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy? 
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) 
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70) 
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:174) 
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:270) 
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) 
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult 
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) 
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) 
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) 
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) 
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) 
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88) 
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:188) 
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71) 
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) 
    ... 4 more 
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:35529 
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) 
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) 
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197) 
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191) 
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:35529 
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) 
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) 
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) 
    ... 1 more 

我知道有我這是兩個過程之間的簡單溝通問題。

所以我顯示/ etc/hosts文件:

127.0.0.1 localhost 
193.xx.xx.xxx vpsxxxx.ovh.net vpsxxxx 

任何想法?

回答

0

檢查SPARK_LOCAL_IP是否在每個從站中設置爲正確的IP。

相關問題