輸出文件未保存在我的存儲桶中，位於AWS s3

我正在嘗試遵循AWS的本教程。我正在快速示例步驟。 https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/輸出文件未保存在我的存儲桶中，位於AWS s3

當我嘗試運行命令：

aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE

我的輸出文件不上我的水桶出現，即使在EMR，它說，作業完成。

SparkWordCountApp Completed 2017-01-24 16:35 (UTC+1) 10 seconds

這是單詞計數Python文件：

from __future__ import print_function 
from pyspark import SparkContext 
import sys 
if __name__ == "__main__": 
    if len(sys.argv) != 3: 
     print("Usage: wordcount ", file=sys.stderr) 
     exit(-1) 
    sc = SparkContext(appName="WordCount") 
    text_file = sc.textFile(sys.argv[1]) 
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b) 
    counts.saveAsTextFile(sys.argv[2]) 
    sc.stop()

這是從羣集日誌文件：

17/01/25 14:40:19 INFO Client: Requesting a new application from cluster with 2 NodeManagers 
17/01/25 14:40:19 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container) 
Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (20480+2048 MB) is above the max threshold (11520 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'. 
    at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:304) 
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:164) 
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119) 
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178) 
    at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) 
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) 
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) 
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) 
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Command exiting with ret '1'

我使用m3.x大實例。

來源

2017-01-24 Ray.R.Chua

「spark.executor.memory」的值是多少？ – franklinsijo

從命令行看，它是20g。 –

是的，你已經提到過，我錯過了它。每m3.xlarge實例只有15克，但執行人請求20g + 2g，而且紗線配置只允許最大11.5g。你可以把它減少到8克，並嘗試運行它？ – franklinsijo

嘗試使輸出目錄成爲子目錄，而不是根目錄。在沒有爲EMR s3客戶端發言的情況下，我知道Hadoop S3A在過去與某個存儲桶的根目錄的rename（）有關。否則，打開日誌並查看從com.aws模塊打印的內容

來源

2017-01-24 16:34:24

我已將日誌文件添加到我的問題中。 –

輸出文件未保存在我的存儲桶中，位於AWS s3

回答

相關問題