提交oozie失敗的Pyspark操作：'[Errno 2]沒有這樣的文件或目錄'

我想通過oozie工作流在Hadoop集羣上執行YARN上的基本火花操作，並且出現以下錯誤（從YARN應用程序日誌）：提交oozie失敗的Pyspark操作：'[Errno 2]沒有這樣的文件或目錄'

>>> Invoking Spark class now >>> 

python: can't open file '/absolute/local/path/to/script.py': [Errno 2] No such file or directory 
Hadoop Job IDs executed by Spark: 

Intercepting System.exit(2) 

<<< Invocation of Main class completed <<< 

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]

但我確定該文件在那裏。實際上，當我運行以下命令：

spark-submit --master yarn --deploy-mode client /absolute/local/path/to/script.py arg1 arg2

它的工作原理。我得到我想要的輸出。

注：我按照本文中的一切得到它設置（我使用Spark2）： https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html

任何想法？

workflow.xml（爲了清楚而簡化）

<action name = "action1"> 
    <spark xmlns="uri:oozie:spark-action:0.1"> 
     <job-tracker>${jobTracker}</job-tracker> 
     <name-node>${nameNode}</name-node> 
     <master>${sparkMaster}</master> 
     <mode>${sparkMode}</mode> 
     <name>action1</name> 
     <jar>${integrate_script}</jar> 
     <arg>arg1</arg> 
     <arg>arg2</arg> 
    </spark> 

    <ok to = "end" /> 
    <error to = "kill_job" /> 
</action>

job.properties（爲了清楚而簡化）在CLUSTER模式下運行時

oozie.wf.application.path=${nameNode}/user/${user.name}/${user.name}/${zone} 
oozie.use.system.libpath=true 
nameNode=hdfs://myNameNode:8020 
jobTracker=myJobTracker:8050 
oozie.action.sharelib.for.spark=spark2 
sparkMaster=yarn 
sparkMode=client 
integrate_script=/absolute/local/path/to/script.py 
zone=somethingUsefulForMe

例外：

diagnostics: Application application_1502381591395_1000 failed 2 times due to AM Container for appattempt_1502381591395_1000_000002 exited with exitCode: -1000 
For more detailed output, check the application tracking page: http://hostname:port/cluster/app/application_1502381591395_1000 Then click on links to logs of each attempt. 
Diagnostics: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip 
java.io.FileNotFoundException: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip 
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427) 
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419) 
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419) 
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) 
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) 
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) 
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) 
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358) 
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745)

編輯2：

我剛剛從外殼嘗試，它由於導入失敗。

/scripts/functions/tools.py 
/scripts/functions/__init__.py 
/scripts/myScript.py 

from functions.tools import *

而這就是失敗的原因。我假設腳本首先被複制到羣集並在那裏運行。我如何獲得所有必需的模塊也可以使用它？修改hdfs上的PYTHONPATH？我明白爲什麼它不工作，只是不知道如何解決它。

EDIT3：

請參見下面的堆棧跟蹤。大多數在線評論都表示，問題在於python代碼將Master設置爲「local」。不是這種情況。更重要的是，我甚至刪除了所有與spark相關的內容（在python腳本中），並且仍然遇到同樣的問題。

Diagnostics: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip 
java.io.FileNotFoundException: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip 
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427) 
    at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419) 
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419) 
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) 
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) 
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) 
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) 
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358) 
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745)

來源

2017-09-01 Tiberiu

是'/絕對/路徑/到/ script.py'本地文件系統路徑或HDFS路徑？ – Mariusz

好點。它是本地的。最初我嘗試了一個HDFS路徑，並得到一個非常明確的錯誤，該腳本必須是本地的。編輯，以避免混淆 – Tiberiu

如果你想用oozie調用腳本，它需要放在HDFS上（因爲你永遠不會知道哪個節點會運行啓動器）。

後，你將它放在HDFS，有必要明確地告訴火花提交從遠程文件系統得到它，所以在job.properties設置：

integrate_script=hdfs:///absolute/hdfs/path/to/script.py

來源

2017-09-02 08:25:44 Mariusz

我不認爲這是正確的。我已經試過這個，並得到以下錯誤：錯誤：只支持本地python文件：hdfs：/// absolute/hdfs/path/to/script.py ........... .. 請注意，我正在使用紗線客戶端模式 – Tiberiu

@Tiberiu您使用的是哪種版本的oozie和spark？紗線客戶模式在這裏不應該是個問題。 – Mariusz

當我在羣集模式下運行時，Oozie 4.2.0，Spark2 – Tiberiu

提交oozie失敗的Pyspark操作：'[Errno 2]沒有這樣的文件或目錄'

回答

相關問題