1
我有一個pyspark程序。它需要幕後的numpy庫。 numpy未安裝在工作節點上,我無權將其安裝在工作節點上。當我運行spark-shell時,我使用'-py-files'並在運行時將numpy庫運送到工作節點。但是,我收到以下錯誤消息。添加/部署依賴庫到pyspark環境
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 232.0 (TID 60801, anp-r01wn02.c03.hadoop.td.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named mtrand
PYTHONPATH was:
/usr/lib/spark/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar:/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/::/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/lib/pyspark.zip:/data/10/yarn/nm/usercache/zakerh2/appcache/application_1462889699566_2857/container_e37_1462889699566_2857_01_000332/numpy.zip
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
這是什麼問題?是否因爲numpy內的另一個依賴?我該如何解決這個問題?
有什麼其他選擇可以將numpy安裝/運輸到工作節點?我見過一些在運行時使用pip
安裝python包,但我不確定Pyspark是如何工作的。任何想法或評論?