2015-10-14 63 views
0

我有我的火花數據框與下面的代碼無法寫入和應用GROUPBY火花數據幀

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
scala> import sqlContext.implicits._ 

scala> case class Wiki(project: String, title: String, count: Int, byte_size: String) 

scala> val data = sc.textFile("s3n://+++/").map(_.split(" ")).map(p => Wiki(p(0), p(1), p(2).trim.toInt, p(3))) 

scala> val df = data.toDF() 

,並嘗試寫入輸出文件:

scala> df.write.parquet("df.parquet") 

或彙總數據與

scala> df.filter("project = 'en'").select("title","count").groupBy("title").sum().collect() 

失敗,類似錯誤如下所示:

WARN TaskSetManager: Lost task 855.0 in stage 0.0 (TID 855, ip-172-31-10-195.ap-northeast-1.compute.internal): org.apache.spark.SparkException: Task failed while writing rows. 
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251) 
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) 
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2 
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:28) 
at $line24.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:28) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:242) 
... 8 more 

我的數據幀的模式是類似以下

root 
---- 
-- project: String (true) 
-- title: String (true) 
-- count: Int (false) 
-- byte_size: String (true) 

我怎麼能解釋這個問題?我該如何解決它?

+0

您的hadoop羣集是否正常工作? – Reactormonk

+0

@Reactormonk我這麼認爲。我在AWS EMR上發佈了一個Spark集羣。一切似乎都沒問題。我在spark-shell中交互工作 –

+0

您可以添加您的Dataframe架構嗎?在這裏你有兩種不同類型的錯誤! – eliasah

回答

0

確保您的分割總是返回4條記錄的數組。也許你有一些格式不正確的條目,或者你用錯誤的字符分割它們。

嘗試用過濾:

val data = sc.textFile("s3n://+++/").map(_.split(" ")).filter(_.size ==4)map(p => Wiki(p(0), p(1), p(2).trim.toInt, p(3))) 

,看看錯誤繼續。拆分後的ArrayIndexOutOfBonds通常意味着某些記錄被錯誤解析。在你的情況下,數字2可能意味着p(2)無法設置,這意味着其中一個記錄只有2個值 - p(0)p(1)