2016-02-26 82 views
2

我在嘗試從hdfs將文件讀入Spark時遇到錯誤。該文件README.md存在於HDFSSpark-Hadoop-> org.apache.hadoop.mapred.InvalidInputException:輸入路徑不存在

[email protected] hadoop]$ hdfs dfs -ls README.md 
16/02/26 00:29:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
-rw-r--r-- 1 spark supergroup  4811 2016-02-25 23:38 README.md 

火花外殼,我給

scala> val readme = sc.textFile("hdfs://localhost:9000/README.md") 
readme: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:27 

scala> readme.count 
16/02/26 00:25:26 DEBUG BlockManager: Getting local block broadcast_4 
16/02/26 00:25:26 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(true, true, false, true, 1) 
16/02/26 00:25:26 DEBUG BlockManager: Getting block broadcast_4 from memory 
16/02/26 00:25:26 DEBUG HadoopRDD: Creating new JobConf and caching it for later re-use 
16/02/26 00:25:26 DEBUG Client: The ping interval is 60000 ms. 
16/02/26 00:25:26 DEBUG Client: Connecting to localhost/127.0.0.1:9000 
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: starting, having connections 1 
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark sending #4 
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark got value #4 
16/02/26 00:25:26 DEBUG ProtobufRpcEngine: Call: getFileInfo took 6ms 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/README.md 
     at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) 
     at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
     at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
     at scala.Option.getOrElse(Option.scala:120) 
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
     at scala.Option.getOrElse(Option.scala:120) 
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
     at org.apache.spark.rdd.RDD.count(RDD.scala:1143) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35) 
     at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37) 
     at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39) 
     at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41) 
     at $iwC$$iwC$$iwC.<init>(<console>:43) 
     at $iwC$$iwC.<init>(<console>:45) 
     at $iwC.<init>(<console>:47) 
     at <init>(<console>:49) 
     at .<init>(<console>:53) 
     at .<clinit>(<console>) 
     at .<init>(<console>:7) 
     at .<clinit>(<console>) 
     at $print(<console>) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:606) 
     at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) 
     at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) 
     at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) 
     at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) 
     at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) 
     at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) 
     at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) 
     at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) 
     at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) 
     at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) 
     at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) 
     at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) 
     at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) 
     at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) 
     at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) 
     at org.apache.spark.repl.Main$.main(Main.scala:31) 
     at org.apache.spark.repl.Main.main(Main.scala) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:606) 
     at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) 
     at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) 
     at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) 
     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) 
     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 


scala> 16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: closed 
16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: stopped, remaining connections 0 

在覈心-site.xml中,我有如下條目:

<configuration> 
<property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://localhost:9000</value> 
</property> 

和hdfs-site.xml具有以下細節:

<configuration> 
<property> 
    <name>dfs.replication</name> 
    <value>1</value> 
</property> 

我失去了一些東西在這裏? 我的操作系統是CentOS的Linux版本1511年2月7日(核心),Hadoop是2.7.2,而Spark是1.6.0彬hadoop2.6

+0

在URI中添加user/spark後,我可以成功從HDFS訪問README.md到Spark。 'scala> val readme = sc.textFile(「hdfs:// localhost:9000/user/spark/README.md」) 自述文件:org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [3] at TEXTFILE在:27' '階> readme.count RES1:長= 141'' – Raxbangalore

回答

0

你可以嘗試改變你的命令如下,然後運行

val readme = sc.textFile("./README.md") 
+0

階> VAL自述= sc.textFile( 「./ README.md」)' '自述:org.apache。 spark.rdd.RDD [字符串] = MapPartitionsRDD [3]在文本文件:27' '階> readme.count' 'org.apache.hadoop.mapred.InvalidInputException:輸入路徑不存在:文件: /home/spark/README.md at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatu (FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat。java:313) ....' – Raxbangalore

1

默認情況下,hdfs dfs -ls會將您的用戶置於hdfs的主文件夾中,而不是hdfs的根。您可以通過比較hdfs dfs -lshdfs dfs -ls /的輸出來輕鬆驗證此情況。當您使用完整的hdfs URL時,您使用的是絕對路徑,並且找不到您的文件(因爲它位於用戶主文件夾中)。當您使用相對路徑,問題就消失了:)

你可能想知道,hdfs dfs -put也會使用HDFS主文件夾作爲文件的默認目標,HDFS的不是根目錄。

2

發生這種情況是由於目錄之間的內部映射。 首先轉到保存文件(README.md)的目錄。 運行命令:df -k .。 您將獲得目錄的實際安裝點。 例如:/xyz 現在,請嘗試在此掛載點內找到您的文件(README.md)。 例如:/xyz/home/omi/myDir/README.md 在您的代碼中使用此路徑。 val readme = sc.textfile("/xyz/home/omi/myDir/README.md");

+0

謝謝!它爲我工作。 –

相關問題