2016-01-27 52 views
0

我試圖從Spark基於應用程序讀取現有文件。這裏是我的代碼片段:Spark:讀取S3文件異常與Spark 1.5.2預建hadoop-2.6

sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "MYKEY") 
sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "MYSECRET") 

val a = sc.textFile("s3://myBucket/TNRealtime/output/2016/01/27/22/45/00/a.txt").map{line => line.split(",")} 
val b = a.collect // **ERROR** producing statement 

我得到異常:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://snapdeal-personalization-dev-us-west-2/TNRealtime/output/2016/01/27/22/45/00/a.txt 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:908) 
    at com.snapdeal.pears.trending.TrendingDecay$.load(TrendingDecay.scala:68) 

奇怪,當我從spark-shell嘗試同樣的代碼片段,我得到不同的錯誤:

java.io.IOException: No FileSystem for scheme: s3 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921) 

燦任何人都可以幫助我理解問題。

+0

你曾經解決過這個問題嗎? – Greg

+0

@Greg:我不記得確切,這已經有一段時間了。 – Mohitt

回答

1

我不知道你的情況是什麼,但是當我在本地運行Spark和希望來訪問S3文件,我指定的S3-路徑中的密鑰和密碼,就像這樣:

sc.textFile("s3://MYKEY:[email protected]/TNRealtime/output/2016/01/27/22/45/00/a.txt") 

也許這也會對你有用。

+1

我試過了。正在獲取:java.io.IOException:方案沒有FileSystem:s3 – Mohitt

1

嘗試用s3替換s3n這是一個新的協議。