1
我具有如JSON輸入文件:scala.MatchError:[ABC空,CDE,3](的類org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)在火花JSON與丟失的字段
{"a": "abc", "b": "bcd", "d": 3},
{"a": "ezx", "b": "hdg", "c": "ssa"},
...
每個對象的某些字段丟失,而不是放置null
值。使用Scala的
在Apache中星火:
import SparkCommons.sparkSession.implicits._
private val inputJsonPath: String = "resources/input/input.json"
private val schema = StructType(Array(
StructField("a", StringType, nullable = false),
StructField("b", StringType, nullable = false),
StructField("c", StringType, nullable = true),
StructField("d", DoubleType, nullable = true)
))
private val inputDF: DataFrame = SparkCommons.sparkSession
.read
.schema(schema)
.json(inputJsonPath)
.cache()
inputDF.printSchema()
val dataRdd = inputDF.rdd
.map {
case Row(a: String, b: String, c: String, d: Double) =>
MyCaseClass(a, b, c, d)
}
val dataMap = dataRdd.collectAsMap()
的MyCaseClass
代碼:
case class MyCaseClass(
a: String,
b: String,
c: String = null,
d: Double = Predef.Double2double(null)
)
我得到下面的模式作爲輸出:
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- c: string (nullable = true)
|-- d: double (nullable = true)
程序編譯,但在運行時,一旦星火正在做的工作,我得到以下例外:
[error] - org.apache.spark.executor.Executor - Exception in task 3.0 in stage 4.0 (TID 21)
scala.MatchError: [abc,bcd,null,3] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at com.matteoguarnerio.spark.SparkOperations$$anonfun$1.apply(SparkOperations.scala:62) ~[classes/:na]
at com.matteoguarnerio.spark.SparkOperations$$anonfun$1.apply(SparkOperations.scala:62) ~[classes/:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) ~[scala-library-2.11.11.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) ~[scala-library-2.11.11.jar:na]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) ~[scala-library-2.11.11.jar:na]
at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:42) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:261) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:259) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.scheduler.Task.run(Task.scala:86) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) ~[spark-core_2.11-2.0.2.jar:2.0.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_144]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_144]
星火版本:2.0.2
斯卡拉版本:2.11.11
- 如何解決這個異常,並遍歷即使某些字段
null
或丟失的RDD匹配和創建對象? - 爲什麼模式,即使我明確地定義了在某些字段上不可爲空並且爲空也是可以爲空的?
UPDATE
我只是用一種變通方法上dataRdd
以避免該問題:
private val dataRdd = inputDF.rdd
.map {
case r: GenericRowWithSchema => {
val a = r.getAs("a").asInstanceOf[String]
val b = r.getAs("b").asInstanceOf[String]
var c: Option[String] = None
var d: Option[Double] = None
try {
c = if (r.isNullAt(r.fieldIndex("c"))) None: Option[String] else Some(r.getAs("c").asInstanceOf[String])
d = if (r.isNullAt(r.fieldIndex("d"))) None: Option[Double] else Some(r.getAs("d").asInstanceOf[Double])
} catch {
case _: Throwable => None
}
MyCaseClass(a, b, c, d)
}
}
,並以這種方式改變MyCaseClass
:
case class MyCaseClass(
a: String,
b: String,
c: Option[String],
d: Option[Double]
)
JSON也被引用爲屬性鍵。我犯了錯誤的文件,它已經在這種形式,問題仍然存在。 –
就像我在我的回答中提到的 - 你的代碼工作正常!沒有問題,除非你的問題中還有別的東西沒有。 – himanshuIIITian