2017-08-30 104 views
0

語境爲什麼spark會拋出一個ArrayIndexOutOfBoundsException異常的空屬性?

我使用星火1.5

我有一個文件records.txt這是ctrl A分隔,並在該文件中第31個索引是爲subscriber_id。對於某些記錄,subscriber_id是空的。用subscriber_id記錄不是空的。

這裏subscriber_id(UK8jikahasjp23)位於一個最後的屬性之前:

99^A2013-12-11^A23421421412^qweqweqw2222^A34232432432^A365633049^A1^A6yudgfdhaf9923^AAC^APrimary DTV^AKKKR DATA+ PVR3^AGrundig^AKKKR PVR3^AKKKR DATA+ PVR3^A127b146^APVR3^AYes^ANo^ANo^ANo^AYes^AYes^ANo^A2017-08-07 21:27:30.000000^AYes^ANo^ANo^A6yudgfdhaf9923^A7290921396551747605^A2013-12-11 16:00:03.000000^A7022497306379992936^AUK8jikahasjp23^A 

記錄與subscriber_id是空的:

23^A2013-12-11^A23421421412^qweqweqw2222^A34232432432^A365633049^A1^A6yudgfdhaf9923^AAC^APrimary DTV^AKKKR DATA+ PVR3^AGrundig^AKKKR PVR3^AKKKR DATA+ PVR3^A127b146^APVR3^AYes^ANo^ANo^ANo^AYes^AYes^ANo^A2017-08-07 21:27:30.000000^AYes^ANo^ANo^A6yudgfdhaf9923^A7290921396551747605^A2013-12-11 16:00:03.000000^A7022497306379992936^A^A 

問題

我得到的java .lang.ArrayIndexOutOfBoundsException for rec具有空的subscriber_id的ords。

爲什麼字符串拋出java.lang.ArrayIndexOutOfBoundsException對於字段subscriber_id的空值?

16/08/20 10點22分18秒WARN scheduler.TaskSetManager:失落的任務31.0舞臺8.0:java.lang.ArrayIndexOutOfBoundsException:31

case class CustomerCard(accountNumber:String, subscriber_id:String,subscriptionStatus:String) 

    object CustomerCardProcess { 
    val log = LoggerFactory.getLogger(this.getClass.getName) 


    def doPerform(sc: SparkContext, sqlContext: HiveContext, custCardRDD: RDD[String]): DataFrame = { 

    import sqlContext.implicits._ 
    log.info("doCustomerCardProcess method started") 
    val splitRDD  = custCardRDD.map(elem => elem.split("\\u0001")) 
    val schemaRDD  = splitRDD.map(arr => new CustomerCard(arr(3).trim, arr(31).trim,arr(8).trim)) 

schemaRDD.toDF().registerTempTable("customer_card") 
val custCardDF = sqlContext.sql(
    """ 
    |SELECT 
    |accountNumber, 
    |subscriber_id 
    |FROM 
    |customer_card 
    |WHERE 
    |subscriptionStatus IN('AB', 'AC', 'PC') 
    |AND accountNumber IS NOT NULL AND LENGTH(accountNumber) > 0 
    """.stripMargin) 

log.info("doCustomerCardProcess method ended") 
custCardDF 
    } 

} 

錯誤

13/09/12 23:22:18 WARN scheduler.TaskSetManager:Lost task 3 1.0 in stage 8.0(TID 595,:java.lang.ArrayIndexOutOfBoundsException:31 at com.org.CustomerCardProcess $$ anonfun $ 2.apply(CustomerCardProcess.scala:23) at com.org.CustomerCardProcess $$ anonfun $ 2。應用(CustomerCardProcess.scala:23) at scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)at scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)at scala.collection.Iterator $$ anon $ 14.hasNext(Iterator.scala:389)at scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:327)at scala.collection.Iterator $$ anon $ 11。 hasNext(Iterator.scala:327)at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118) 在 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 在 org.apache .spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88)at org.apache.spark.executor.Executor $ TaskRunner.run (Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util。concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615) 在java.lang.Thread.run(Thread.java:745)

誰能幫我解決這個問題?

回答

3

split函數忽略分割線結束處的所有空字段。所以,

你的下面一行

val splitRDD = custCardRDD.map(elem => elem.split("\\u0001")) 

更改爲

val splitRDD = custCardRDD.map(elem => elem.split("\\u0001", -1)) 

-1告訴考慮所有的空字段。

+0

很好的回答,它是否適用於其他分隔符,如逗號,管道還是僅適用於ctrl A? –

+0

這適用於您在分割函數中使用的所有分隔符。它的一個參數來分割功能。 :)感謝您的接受和upvote –

+0

嗨Ramesh我看到,分裂功能是忽略所有空字段只有當該字段位於我們分裂的行的最後位置。如果在第3位或第8位有空字段,則分離功能可以正常工作。我的理解是否正確? –

相關問題