2015-09-16 93 views
1

我按照Spark的網站上的說明進行操作,我在亞馬遜上運行了1個主站和1個從站。但是,我無法使用pyspark連接到主節點。我可以使用SSH連接到主節點,而不會出現任何問題。我的命令 spark-ec2 --key-pair =東-1A推出圖形集羣無法使用pyspark連接到EC2的Spark羣集

我可以去http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:8080/

,看到火花運行起來我也看到這個星火法師在

spark://ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077 

然而,當我運行命令

MASTER=spark://ec2-54-152-xx-xx.compute-1.amazonaws.com:7077 pyspark 

我得到這個錯誤

2015-09-16 15:39:31,800 ERROR actor.OneForOneStrategy (Slf4jLogger.scala:apply$mcV$sp(66)) - 
java.lang.NullPointerException 
    at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160) 
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) 
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) 
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) 
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) 
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) 
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) 
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) 
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465) 
    at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61) 
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) 
    at akka.actor.ActorCell.invoke(ActorCell.scala:487) 
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) 
    at akka.dispatch.Mailbox.run(Mailbox.scala:220) 
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) 
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
2015-09-16 15:39:31,804 INFO client.AppClient$ClientActor (Logging.scala:logInfo(59)) - Connecting to master akka.tcp://[email protected]:7077/user/Master... 
2015-09-16 15:39:31,955 INFO util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52333. 
2015-09-16 15:39:31,956 INFO netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 52333 
2015-09-16 15:39:31,959 INFO storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager 
2015-09-16 15:39:31,964 INFO storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(59)) - Registering block manager xxx:52333 with 265.1 MB RAM, BlockManagerId(driver, xxx, 52333) 
2015-09-16 15:39:31,969 INFO storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager 
2015-09-16 15:39:32,458 ERROR spark.SparkContext (Logging.scala:logError(96)) - Error initializing SparkContext. 
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext 
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) 
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503) 
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007) 
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543) 
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
    at py4j.Gateway.invoke(Gateway.java:214) 
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) 
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:745) 
2015-09-16 15:39:32,460 INFO spark.SparkContext (Logging.scala:logInfo(59)) - SparkContext already stopped. 
Traceback (most recent call last): 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/shell.py", line 43, in <module> 
    sc = SparkContext(appName="PySparkShell", pyFiles=add_files) 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 113, in __init__ 
    conf, jsc, profiler_cls) 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 165, in _do_init 
    self._jsc = jsc or self._initialize_context(self._conf._jconf) 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 219, in _initialize_context 
    return self._jvm.JavaSparkContext(jconf) 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__ 
    File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. 
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext 
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) 
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503) 
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007) 
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543) 
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
    at py4j.Gateway.invoke(Gateway.java:214) 
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) 
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) 
    at py4j.GatewayConnection.run(GatewayConnection.java:207) 
    at java.lang.Thread.run(Thread.java:745) 

回答

3

Spark_ec2不會不打開端口7077的羣集之外的傳入連接主節點上。

您可以檢查AWS控制檯/ EC2 /網絡&安全/安全組和檢查圖形集羣主控安全組的入站選項卡。

您可以添加規則以打開端口7077.

入站連接,但它建議在EC2集羣運行從主計算機pyspark(主要是星火的App驅動程序),並避免網絡之外運行的驅動程序。 造成這種情況的原因 - 增加延遲和防火牆連接設置問題 - 您需要打開一些端口,以便執行程序可以連接到計算機上的驅動程序。

所以要走的路是登錄到SSH集羣與此命令:

spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a login graph-cluster 

並運行主服務器的命令:

cd spark 
bin/pyspark 

你需要轉移相關文件(你的腳本和數據)掌握。我通常在S3上保存數據並用vim編輯腳本文件或者啓動ipython筆記本。

順便說一句,後者非常簡單 - 您需要將傳入連接的規則從計算機IP添加到EC2控制檯主服務器安全組中的端口18888。然後在集羣上運行命令:

IPYTHON_OPTS = 「筆記本--pylab直列--port = 18888 --ip = '*'」 pyspark

然後你就可以用http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:18888/

訪問