2016-02-05 81 views
0

我已經在Windows 10中安裝了Spark,並且Pyspark控制檯的安裝工作正常。但是最近我試圖配置Ipython Notebook來使用Spark安裝。我已經做了以下進口無法讓Spark在Windows上的IPython Notebook上工作

os.environ['SPARK_HOME'] = "E:/Spark/spark-1.6.0-bin-hadoop2.6" 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/bin") 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python") 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/pyspark") 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib") 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip") 
sys.path.append("E:/Spark/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9- src.zip") 
sys.path.append("C:/Program Files/Java/jdk1.8.0_51/bin") 

也能正常工作爲創建SparkContext,也爲類似的代碼

sc.parallelize([1, 2, 3]) 

但是,當我寫了下面的

file = sc.textFile("E:/scripts.sql") 
words = sc.count() 

我得到以下錯誤

Py4JJavaError Traceback (most recent call last) 
<ipython-input-22-3c172daac960> in <module>() 
1 file = sc.textFile("E:/scripts.sql") 
----> 2 file.count() 

E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in count(self) 
1002   3 
1003   """ 
-> 1004   return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 
1005 
1006  def stats(self): 

E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in sum(self) 
993   6.0 
994   """ 
--> 995   return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 
996 
997  def count(self): 

E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in fold(self, zeroValue, op) 
867   # zeroValue provided to each partition is unique from the one provided 
868   # to the final reduce call 
--> 869   vals = self.mapPartitions(func).collect() 
870   return reduce(op, vals, zeroValue) 
871 

E:/Spark/spark-1.6.0-bin-hadoop2.6/python\pyspark\rdd.py in collect(self) 
769   """ 
770   with SCCallSiteSync(self.context) as css: 
--> 771    port =  self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
772   return list(_load_from_socket(port, self._jrdd_deserializer)) 
773 

E:\Spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args) 
811   answer = self.gateway_client.send_command(command) 
812   return_value = get_return_value(
--> 813    answer, self.gateway_client, self.target_id, self.name) 
814 
815   for temp_arg in temp_args: 

E:\Spark\spark-1.6.0-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 
306     raise Py4JJavaError(
307      "An error occurred while calling {0}{1}{2}.\n". 
--> 308      format(target_id, ".", name), value) 
309    else: 
310     raise Py4JError(Py4JJavaError: An error occurred while calling  z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8, localhost): org.apache.spark.SparkException: Python worker did not connect back in time at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:136) 
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) 
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134) 
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:89) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
at java.lang.Thread.run(Unknown Source) 
Caused by: java.net.SocketTimeoutException: Accept timed out 
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) 
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source) 
at java.net.AbstractPlainSocketImpl.accept(Unknown Source) 
at java.net.PlainSocketImpl.accept(Unknown Source) 
at java.net.ServerSocket.implAccept(Unknown Source) 
at java.net.ServerSocket.accept(Unknown Source) 
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:131) 
... 12 more 

Driver stacktrace: 
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
at scala.Option.foreach(Option.scala:236) 
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
at org.apache.spark.rdd.RDD.collect(RDD.scala:926) 
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) 
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
at py4j.Gateway.invoke(Gateway.java:259) 
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
at py4j.commands.CallCommand.execute(CallCommand.java:79) 
at py4j.GatewayConnection.run(GatewayConnection.java:209) 
at java.lang.Thread.run(Unknown Source) 
Caused by: org.apache.spark.SparkException: Python worker did not connect back in time 
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:136) 
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65) 
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134) 
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:89) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
... 1 more 
Caused by: java.net.SocketTimeoutException: Accept timed out 
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method) 
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source) 
at java.net.AbstractPlainSocketImpl.accept(Unknown Source) 
at java.net.PlainSocketImpl.accept(Unknown Source) 
at java.net.ServerSocket.implAccept(Unknown Source) 
at java.net.ServerSocket.accept(Unknown Source) 
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:131) 
... 12 more 

請幫忙解決這個問題,因爲我在一個短時間的項目。

回答

0

嘗試轉義反斜槓。

file = sc.textFile("E:\\scripts.sql") 

編輯以添加第二個項目看:

另外,我注意到你叫:

words = sc.count() 

試試這個,它工作在我的Windows 10安裝:

file = sc.textFile("E:/scripts.sql") 
words = file.count() 
+0

已經嘗試過,但無濟於事。我認爲相關的錯誤在於錯誤信息'Py4JJavaError:調用z:org.apache.spark.api.python.PythonRDD.collectAndServe時發生錯誤。 :org.apache.spark.SparkException:由於階段失敗導致作業中止:階段3.0中的任務0失敗1次,最近失敗:階段3.0中丟失的任務0.0(TID 3,localhost):org.apache.spark.SparkException :Python工作人員沒有及時連接回去 –

+0

@Maurer仍然沒有辦法前進。我更新ipython筆記本jupyter,但無濟於事。錯誤仍然以... 'Py4JJavaError:調用z:org.apache.spark.api.python.PythonRDD.collectAndServe時發生錯誤。 –