2015-06-18 88 views
1

我想在本地使用spark。我的環境是在apache Spark中讀取本地Windows文件

  1. Eclipse Luna預構建scala支持。
  2. 創建一個項目並轉換爲maven並添加Spark核心依賴項Jar。
  3. 下載WinUtils.exe並設置HADOOP_HOME路徑。

我試圖運行的代碼是

object HelloWorld { 
     def main(args: Array[String]) { 
      println("Hello, world!") 
    /*  val master = args.length match { 
      case x: Int if x > 0 => args(0) 
      case _ => "local" 
      }*/ 
      /*val sc = new SparkContext(master, "BasicMap", System.getenv("SPARK_HOME"))*/ 
      val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[2]").set("spark.executor.memory","1g") 
      val sc = new SparkContext(conf) 
     val input = sc.textFile("C://Users//user name//Downloads//error.txt") 
    // Split it up into words. 
    val words = input.flatMap(line => line.split(" ")) 
    // Transform into pairs and count. 
    val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} 
      counts.foreach(println) 

但是,當我使用sparkContext讀取文件時,出現以下錯誤:

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/Downloads/error.txt 
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) 
at scala.Option.getOrElse(Option.scala:120) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) 
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65) 
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) 
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) 
at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289) 
at com.examples.HelloWorld$.main(HelloWorld.scala:23) 
at com.examples.HelloWorld.main(HelloWorld.scala) 

可有人向我提供洞察力上如何克服這個錯誤?

+0

你有你的cygwin的道路上? – abalcerek

+0

@us er52045不,我沒有cygwin。 – Satya

+0

我很確定你需要它。 – abalcerek

回答

0

問題是用戶名有空間是創建所有問題。一旦我移動到沒有空格的文件路徑,它工作正常。

0

它在sparksession.builder爲我工作的W10 火花2 爲 () 的.config( 「spark.sql.warehouse.dir」,「文件:///」)

和路徑使用\的...

PS一定要放滿擴展名的文件

[本地] [文件] [spark2]