我在參數中帶有目的地的RDD [String]上調用此方法。 (Scala)Spark RDD方法「saveAsTextFile」拋出異常即使在刪除輸出目錄之後。 org.apache.hadoop.mapred.FileAlreadyExistsException
即使在開始之前刪除目錄後,該過程也會出現此錯誤。 我在輸出位置位於aws S3的EMR羣集上運行此過程。 下面是使用命令:
spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g --driver-cores 4 s3://bi-aws-users/sbatheja/hotel-shopper-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d 3 -p 100 --search-bucket s3a://hda-prod-business.hotwire.hotel.search --prd-output-path s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput/
登錄:
16/07/07 11:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/07 11:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/07 11:27:47 INFO SparkContext: Successfully stopped SparkContext
16/07/07 11:27:47 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: **org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput already exists)**
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/07 11:27:47 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/07/07 11:27:47 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1467889642439_0001
16/07/07 11:27:47 INFO ShutdownHookManager: Shutdown hook called
16/07/07 11:27:47 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1467889642439_0001/spark-7f836950-a040-4216-9308-2bb4565c5649
它創建中的位置,其中包含空白部分文件 「_temporary」 目錄。
你確定你運行作業之前,文件夾不存在?你爲什麼使用's3a'而不是's3'或's3n'? –
是的,我在一切之前刪除了目錄。基本原因是s3支持高達5gb,s3a沒有這樣的限制。也嘗試過s3。同樣的問題:( – saurabh7389
也許你的問題是在代碼中的其他地方失敗,這就是爲什麼臨時文件,並且你有一些重試機制,試圖再次運行代碼,然後失敗,因爲該目錄已經存在與以前的嘗試和遺漏了嗎? –