2015-05-05 41 views
3

我試圖提交火花作業指定spark-csv包作爲依賴性:如何將應用程序提交到紗線羣集,以便包裝中的罐子也被複制?

spark/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --deploy-mode cluster --master yarn-cluster script.py 

,但我得到了以下異常(片段)

15/05/05 22:23:46 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar 

spark集羣的安裝和配置有以下腳本:

aws emr create-cluster --name sandbox --ami-version 3.6 --instance-type m3.xlarge --instance-count 3 \ 
    --ec2-attributes KeyName=sandbox \ 
    --applications Name=Hive \ 
    --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \ 
    --log-uri s3://mybucket/spark-logs \ 
    --steps \ 
    Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server \ 
    Name=SparkConfigure,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://support.elasticmapreduce/spark/configure-spark.bash,spark.default.parallelism=100,spark.locality.wait.rack=0] 

這應該廣泛適用於Spark開發人員,因爲我想象使用EMR和Spark並不是一個不常見的工作流程,我也沒有做太複雜的事情。

這裏的擴展堆棧跟蹤:

Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Ivy Default Cache set to: /home/hadoop/.ivy2/cache 
The jars for the packages stored in: /home/hadoop/.ivy2/jars 
:: loading settings :: url = jar:file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml 
com.databricks#spark-csv_2.10 added as a dependency 
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 
    confs: [default] 
    found com.databricks#spark-csv_2.10;1.0.3 in central 
    found org.apache.commons#commons-csv;1.1 in central 
:: resolution report :: resolve 238ms :: artifacts dl 8ms 
    :: modules in use: 
    com.databricks#spark-csv_2.10;1.0.3 from central in [default] 
    org.apache.commons#commons-csv;1.1 from central in [default] 
    --------------------------------------------------------------------- 
    |     |   modules   || artifacts | 
    |  conf  | number| search|dwnlded|evicted|| number|dwnlded| 
    --------------------------------------------------------------------- 
    |  default  | 2 | 0 | 0 | 0 || 2 | 0 | 
    --------------------------------------------------------------------- 
:: retrieving :: org.apache.spark#spark-submit-parent 
    confs: [default] 
    0 artifacts copied, 2 already retrieved (0kB/10ms) 
15/05/05 22:07:23 INFO client.RMProxy: Connecting to ResourceManager at /172.31.13.205:9022 
15/05/05 22:07:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers 
15/05/05 22:07:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container) 
15/05/05 22:07:23 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 
15/05/05 22:07:23 INFO yarn.Client: Setting up container launch context for our AM 
15/05/05 22:07:23 INFO yarn.Client: Preparing resources for our AM container 
15/05/05 22:07:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://172.31.13.205:9000/user/hadoop/.sparkStaging/application_1430862769169_0005/spark-assembly-1.3.0-hadoop2.4.0.jar 
15/05/05 22:07:24 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false 
15/05/05 22:07:24 INFO metrics.MetricsSaver: Created MetricsSaver j-3C91V87M8TXWD:i-e4bd8f2d:SparkSubmit:05979 period:60 /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin 
15/05/05 22:07:25 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar 
    at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:129) 
    at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:460) 
    at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2120) 
    at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2116) 
    at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) 
    at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2116) 
    at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:591) 
    at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:203) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:285) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:280) 
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:280) 
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:278) 
    at scala.collection.immutable.List.foreach(List.scala:318) 
    at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:278) 
    at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384) 
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102) 
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:619) 
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647) 
    at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:606) 
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) 
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) 
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) 
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) 
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
15/05/05 22:07:25 INFO metrics.MetricsSaver: Saved 3:3 records to /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin 
Command exiting with ret '1' 
+0

你爲什麼使用EMR?與普通EC2相比,有什麼優勢?在EC2上運行Spark的[官方腳本](https://spark.apache.org/docs/1.3.1/ec2-scripts.html)。電子病歷是不是使事情複雜化,成本更高? –

+1

@DanielDarabos我切換到Spark附帶的'spark-ec2'腳本,我沒有任何問題。 –

+2

@DanielDarabos實際上有很多區別。主要是集羣的正常運行時間。如果使用ec2腳本爲大約50多臺機器設置集羣,則需要超過45分鐘才能完成並準備使用它們。 EMR在不到一半的時間內完成了這項工作。 Plus EMR允許您非常方便地自動化批量火花作業。當你需要使用spark-ec2腳本來做到這一點時,它會很痛苦。特別是在任務失敗的情況下記錄日誌。 – Sohaib

回答

2

我覺得這可能是一個Apache星火錯誤,雖然我沒有看到它在Spark JIRA報道。然而,http://apache-spark-user-list.1001560.n3.nabble.com/Resources-not-uploaded-when-submitting-job-in-yarn-client-mode-td21516.html似乎描述了相同的情況。根據那篇文章,問題是在您的部署設置中,Spark錯誤地認爲目標系統與客戶端系統相同,因此它放棄了複製:

15/05/05 22:07:25 INFO yarn.Client:源文件系統和目標文件系統是相同的。不是抄襲/home/hadoop/.ivy2/jars/spark-csv_2.10.jar

我建議你嘗試--jars代替--packages(見Submitting Applications)。如果可行,請提交有關此問題的錯誤!

+0

解決此問題的設置是什麼? – nish1013

+0

不知道。 '--jars'而不是'--packages'有幫助嗎?最近發佈在https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%[email protected].com%3E上的帖子表明,也許你只需要一個「核心站點」。 xml'文件。 –

+0

我已經有了一個core-site.xml。我從Ambari的「下載客戶端配置」選項下載了YARN服務。這是我複製到我的開發機器Hadoop配置的版本 – nish1013

相關問題