2015-11-01 44 views
8

config.json是一個小的JSON文件:星火崩潰而當與AWS-Java的SDK鏈接閱讀JSON文件

{ 
    "toto": 1 
} 

我做了一個簡單的代碼與sc.textFile讀取JSON文件(由於文件可在S3上,本地或HDFS,所以文本文件方便)

import org.apache.spark.{SparkContext, SparkConf} 

object testAwsSdk { 
    def main(args:Array[String]):Unit = { 
    val sparkConf = new SparkConf().setAppName("test-aws-sdk").setMaster("local[*]") 
    val sc = new SparkContext(sparkConf) 
    val json = sc.textFile("config.json") 
    println(json.collect().mkString("\n")) 
    } 
} 

的SBT文件僅拉出spark-core

libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "1.5.1" % "compile" 
) 

程序按預期工作,在標準輸出中寫入config.json的內容。

現在我想鏈接到aws-java-sdk,亞馬遜的sdk來訪問S3。

libraryDependencies ++= Seq(
    "com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile", 
    "org.apache.spark" %% "spark-core" % "1.5.1" % "compile" 
) 

執行相同的代碼,spark會引發以下異常。

Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope) 
at [Source: {"id":"0","name":"textFile"}; line: 1, column: 1] 
    at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:148) 
    at com.fasterxml.jackson.databind.DeserializationContext.mappingException(DeserializationContext.java:843) 
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.addBeanProps(BeanDeserializerFactory.java:533) 
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:220) 
    at com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143) 
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409) 
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358) 
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265) 
    at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245) 
    at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143) 
    at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439) 
    at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:3666) 
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3558) 
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2578) 
    at org.apache.spark.rdd.RDDOperationScope$.fromJson(RDDOperationScope.scala:82) 
    at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133) 
    at org.apache.spark.rdd.RDDOperationScope$$anonfun$5.apply(RDDOperationScope.scala:133) 
    at scala.Option.map(Option.scala:145) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:133) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) 
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709) 
    at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1012) 
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:827) 
    at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:825) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) 
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709) 
    at org.apache.spark.SparkContext.textFile(SparkContext.scala:825) 
    at testAwsSdk$.main(testAwsSdk.scala:11) 
    at testAwsSdk.main(testAwsSdk.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) 

閱讀棧,似乎當AWS-Java的SDK鏈接,sc.textFile檢測到該文件是一個JSON文件,並嘗試與傑克遜承擔一定的格式,它無法找到,當然對它進行解析。我需要鏈接aws-java-sdk,所以我的問題是:

1-爲什麼要添加aws-java-sdk修改spark-core的行爲?

2-是否有解決方法(該文件可以在HDFS,S3或本地)?

+0

這是因爲aws-java-sdk正在使用jackson庫的最新版本2.5.3,並且spark使用的是舊版本2.4.4。我面臨同樣的問題,但無法解決它。如果你找到了解決方案,請分享。謝謝 –

+0

嗨Hafiz ...漂亮的anoying不是嗎?我將案件發送給AWS。他們確認這是一個兼容性問題。儘管他們沒有告訴我一個明確的解決方案。將嘗試儘快整理出來。 – Boris

+1

嗨鮑里斯!是的,這是討厭面對這個問題,但我已經通過從spark-core中排除jackson核心和jackson模塊庫並添加了jackson核心最新庫依賴項來解決它 –

回答

10

對亞馬遜的支持。這是傑克遜圖書館的依賴性問題。在SBT,覆蓋傑克遜:

libraryDependencies ++= Seq( 
"com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile", 
"org.apache.spark" %% "spark-core" % "1.5.1" % "compile" 
) 

dependencyOverrides ++= Set( 
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4" 
) 

他們的答案: 我們這樣做在Mac上,EC2(redhat的AMI)實例和EMR(亞馬遜的Linux)。 3個不同的環境。問題的根源在於sbt構建依賴關係圖,然後通過驅逐舊版本並選擇最新版本的依賴庫來處理版本衝突問題。在這種情況下,Spark的版本取決於2.4版本的jackson庫,而AWS SDK需要2.5版本。所以存在版本衝突,並且sbt驅逐spark的依賴版本(這是較舊版本)並選擇AWS SDK版本(這是最新版本)。

1

添加到Boris' answer,如果你不想使用傑克遜(也許在未來將要升級的Spark)的固定版本,但還是要放棄從AWS的人,你可以做到以下幾點:

libraryDependencies ++= Seq( 
    "com.amazonaws" % "aws-java-sdk" % "1.10.30" % "compile" excludeAll (
    ExclusionRule("com.fasterxml.jackson.core", "jackson-databind") 
), 
    "org.apache.spark" %% "spark-core" % "1.5.1" % "compile" 
)