2016-09-15 111 views
6

我有一個要求,我必須在Spark羣集的多個節點上使用Apache Spark運行Windows批處理文件。在Apache Spark中通過管道運行Windows批處理文件

那麼是否有可能使用Apache Spark的管道概念做同樣的事情?

我已經在Ubuntu機器上使用Spark中的管道運行shell文件。我下面的代碼做同樣的優良運行:

data = ["hi","hello","how","are","you"] 
distScript = "/home/aawasthi/echo.sh" 
distScriptName = "echo.sh" 
sc.addFile(distScript) 
RDDdata = sc.parallelize(data) 
print RDDdata.pipe(SparkFiles.get(distScriptName)).collect() 

我試着去適應相同的代碼運行在Windows具有火花機Windows批處理文件(1.6預建Hadoop的2.6)安裝。但它給了我在步驟sc.addFile上的錯誤。代碼如下:

batchFile = "D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv" 
batchFileName = "runOpenCv" 
sc.addFile(batchFile) 

錯誤火花拋出低於:

Py4JJavaError        Traceback (most recent call last) 
<ipython-input-11-9e13c265cbae> in <module>() 
----> 1 sc.addFile(batchFile)` 

Py4JJavaError: An error occurred while calling o160.addFile. 
: java.io.FileNotFoundException: Added file D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv does not exist. 
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1364) 
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 

雖然批處理文件存在於指定位置。

UPDATE
新增.bat爲擴展在batchFile & batchFileName & file:///在文件路徑的起點。修改後的代碼是:

from pyspark import SparkFiles 
from pyspark import SparkContext  
sc  
batchFile = "file:///D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv.bat" 
batchFileName = "runOpenCv.bat" 
sc.addFile(batchFile) 
RDDdata = sc.parallelize(["hi","hello"]) 
print SparkFiles.get("runOpenCv.bat") 
print RDDdata.pipe(SparkFiles.get(batchFileName)).collect() 

現在,它不會在addFile步給錯誤,並print SparkFiles.get("runOpenCv.bat")打印路徑
C:\Users\abhilash.awasthi\AppData\Local\Temp\spark-c0f383b1-8365-4840-bd0f-e7eb46cc6794\userFiles-69051066-f18c-45dc-9610-59cbde0d77fe\runOpenCv.bat
所以添加文件。但在代碼的最後一步,它拋出下面的錯誤:

Py4JJavaError        Traceback (most recent call last) 
<ipython-input-6-bf2b8aea3ef0> in <module>() 
----> 1 print RDDdata.pipe(SparkFiles.get(batchFileName)).collect() 

D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.pyc in collect(self) 
    769   """ 
    770   with SCCallSiteSync(self.context) as css: 
--> 771    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
    772   return list(_load_from_socket(port, self._jrdd_deserializer)) 
    773 

D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args) 
    811   answer = self.gateway_client.send_command(command) 
    812   return_value = get_return_value(
--> 813    answer, self.gateway_client, self.target_id, self.name) 
    814 
    815   for temp_arg in temp_args: 

D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw) 
    43  def deco(*a, **kw): 
    44   try: 
---> 45    return f(*a, **kw) 
    46   except py4j.protocol.Py4JJavaError as e: 
    47    s = e.java_exception.toString() 

D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 
    306     raise Py4JJavaError(
    307      "An error occurred while calling {0}{1}{2}.\n". 
--> 308      format(target_id, ".", name), value) 
    309    else: 
    310     raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func 
    return f(iterator) 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func 
    shlex.split(command), env=env, stdin=PIPE, stdout=PIPE) 
    File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__ 
    errread, errwrite) 
    File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child 
    startupinfo) 
WindowsError: [Error 2] The system cannot find the file specified 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

Driver stacktrace: 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at scala.Option.foreach(Option.scala:236) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926) 
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) 
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func 
    return f(iterator) 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func 
    shlex.split(command), env=env, stdin=PIPE, stdout=PIPE) 
    File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__ 
    errread, errwrite) 
    File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child 
    startupinfo) 
WindowsError: [Error 2] The system cannot find the file specified 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    ... 1 more 
+2

在Windows批處理文件中擴展名爲'.cmd'或'.bat'。你有沒有試圖包含它? –

+0

@MCND哦,我的愚蠢..是擴展名應該在那裏。在'batchFile'&'batchFileName'中添加'.bat'後,我沒有收到文件不存在的錯誤。但是我得到了不同的錯誤,如更新後的答案所示。 –

+0

'沒有FileSystem for scheme:D',所以'D:'沒有按需要處理,也許(對不起,如果這是愚蠢的,我知道關於批處理文件的一些東西,但java不是我的區域),你需要一個URI,像'file:/// D:/ ...'需要 –

回答

0

請逃脫/

batchFile = "D://spark-1.6.2-bin-hadoop2.6//data//OpenCV//runOpenCv"

而且,AA以上的建議,它可能有.CMD或.bat擴展名。

+0

轉義字符是\,所以不需要轉義'/' –