我在Zeppelin中遇到了一個問題,當我嘗試對我創建的臨時表(數據框)執行SQL操作時,我總是得到一個IndexOutOfBounds錯誤。Zeppelin中的IndexOutOfBounds錯誤
這裏是我的代碼:
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
import org.apache.spark.sql.SparkSession
//import sqlContext._
val realdata = sc.textFile("/root/application.txt")
case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, sensor1: String, sensor2: String, sensor3: String, sensor4: String, sensor5: String, sensor6: String, sensor7: String, sensor8: String, batchsize: String, troepje1: String, troepje2: String)
val mapData = realdata
.filter(line => line.contains("data") && line.contains("INFO"))
.map(s => s.split(" ").toList)
.map(
s => testClass(s(0),
s(1).split(",")(0),
s(1).split(",")(1),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(15),
s(16),
s(17),
s(18),
s(19),
s(20),
s(21),
s(22),
"",
"",
""
)
).toDF
//mapData.count()
//mapData.printSchema()
mapData.registerTempTable("temp_carefloor")
然後在未來的筆記本我試着像一些簡單:
%sql
select * from temp_carefloor limit 10
我收到以下錯誤:
java.lang.IndexOutOfBoundsException: 18
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:84)
at $line128330188484.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$3.apply(<console>:72)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:748)
現在我我確定它與我的數據輸出方式有關。 但我只是無法弄清楚我做錯了什麼,我真的在這裏打我的頭。真的希望有人能幫助我。
編輯: 這裏是我嘗試提取的有用數據的摘錄。
2016-03-10 07:18:58,985 INFO [http-nio-8080-exec-1] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, true, false, false, false]
2016-03-10 07:18:58,992 INFO [http-nio-8080-exec-7] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false]
2016-03-10 07:18:59,907 INFO [http-nio-8080-exec-4] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [false, false, false, false, false, false, false, false]
2016-03-10 07:19:10,418 INFO [http-nio-8080-exec-9] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor 12FR received update from tile: 12G0, data = [true, true, false, false, false, false, false, false]
您可以在這裏看到完整的平面文件:http://upload.grecom.nl/uploads/jeffrey/application.txt
你的數據肯定存在問題,請你提供一個樣本,以便我們可以看看 –
我編輯了我的問題,以便你可以看到數據和完整的平面文件。感謝那。 – Jdeboer
我注意到的第一件事是當你用''''分割你的行時''''''''''''''''''''''''''''因爲它們被空間包圍着,所以'tile:'和'='這些字段我認爲這對你來說是一個問題? –