我試圖提交火花作業指定spark-csv
包作爲依賴性:如何將應用程序提交到紗線羣集,以便包裝中的罐子也被複制?
spark/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 --deploy-mode cluster --master yarn-cluster script.py
,但我得到了以下異常(片段)
15/05/05 22:23:46 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar
的spark
集羣的安裝和配置有以下腳本:
aws emr create-cluster --name sandbox --ami-version 3.6 --instance-type m3.xlarge --instance-count 3 \
--ec2-attributes KeyName=sandbox \
--applications Name=Hive \
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \
--log-uri s3://mybucket/spark-logs \
--steps \
Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server \
Name=SparkConfigure,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://support.elasticmapreduce/spark/configure-spark.bash,spark.default.parallelism=100,spark.locality.wait.rack=0]
這應該廣泛適用於Spark開發人員,因爲我想象使用EMR和Spark並不是一個不常見的工作流程,我也沒有做太複雜的事情。
這裏的擴展堆棧跟蹤:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central
:: resolution report :: resolve 238ms :: artifacts dl 8ms
:: modules in use:
com.databricks#spark-csv_2.10;1.0.3 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/10ms)
15/05/05 22:07:23 INFO client.RMProxy: Connecting to ResourceManager at /172.31.13.205:9022
15/05/05 22:07:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
15/05/05 22:07:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
15/05/05 22:07:23 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/05 22:07:23 INFO yarn.Client: Setting up container launch context for our AM
15/05/05 22:07:23 INFO yarn.Client: Preparing resources for our AM container
15/05/05 22:07:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.0.d/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://172.31.13.205:9000/user/hadoop/.sparkStaging/application_1430862769169_0005/spark-assembly-1.3.0-hadoop2.4.0.jar
15/05/05 22:07:24 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false
15/05/05 22:07:24 INFO metrics.MetricsSaver: Created MetricsSaver j-3C91V87M8TXWD:i-e4bd8f2d:SparkSubmit:05979 period:60 /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin
15/05/05 22:07:25 INFO yarn.Client: Source and destination file systems are the same. Not copying /home/hadoop/.ivy2/jars/spark-csv_2.10.jar
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://172.31.13.205:9000/home/hadoop/.ivy2/jars/spark-csv_2.10.jar
at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:129)
at org.apache.hadoop.fs.AbstractFileSystem.resolvePath(AbstractFileSystem.java:460)
at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2120)
at org.apache.hadoop.fs.FileContext$23.next(FileContext.java:2116)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.resolve(FileContext.java:2116)
at org.apache.hadoop.fs.FileContext.resolvePath(FileContext.java:591)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:203)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:285)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4$$anonfun$apply$1.apply(Client.scala:280)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:280)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$4.apply(Client.scala:278)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:278)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:619)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/05/05 22:07:25 INFO metrics.MetricsSaver: Saved 3:3 records to /mnt/var/em/raw/i-e4bd8f2d_20150505_SparkSubmit_05979_raw.bin
Command exiting with ret '1'
你爲什麼使用EMR?與普通EC2相比,有什麼優勢?在EC2上運行Spark的[官方腳本](https://spark.apache.org/docs/1.3.1/ec2-scripts.html)。電子病歷是不是使事情複雜化,成本更高? –
@DanielDarabos我切換到Spark附帶的'spark-ec2'腳本,我沒有任何問題。 –
@DanielDarabos實際上有很多區別。主要是集羣的正常運行時間。如果使用ec2腳本爲大約50多臺機器設置集羣,則需要超過45分鐘才能完成並準備使用它們。 EMR在不到一半的時間內完成了這項工作。 Plus EMR允許您非常方便地自動化批量火花作業。當你需要使用spark-ec2腳本來做到這一點時,它會很痛苦。特別是在任務失敗的情況下記錄日誌。 – Sohaib