2016-09-24 66 views
1

我正在使用內置的Scala 2.10.5的Spark 1.6.1。我正在檢查一些天氣數據,有時候我有十進制值。下面是代碼:DecimalType問題 - 類java.lang.String的scala.MatchError

val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import sqlContext.implicits._ 

import org.apache.spark.sql.Row 
import org.apache.spark.rdd.RDD 
import org.apache.spark.sql._ 
import org.apache.spark.sql.types._ 
import org.apache.spark.sql.SQLContext 

val rawData=sc.textFile("Example_Weather.csv").map(_.split(",")) 

val header=rawData.first 

val rawDataNoHeader=rawData.filter(_(0)!= header(0)) 

rawDataNoHeader.first 

object schema { 
val weatherdata= StructType(Seq(
StructField("date", StringType, true), 
StructField("Region", StringType, true), 
StructField("Temperature", DecimalType(32,16), true), 
StructField("Solar", IntegerType, true), 
StructField("Rainfall", DecimalType(32,16), true), 
StructField("WindSpeed", DecimalType(32,16), true)) 
) 
} 

val dataDF=sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),p(2),p(3),p(4),p(5))), schema.weatherdata) 

dataDF.registerTempTable("weatherdataSQL") 

val datasql = sqlContext.sql("SELECT * FROM weatherdataSQL") 

datasql.collect().foreach(println) 

當運行的代碼,我得到了預期的模式和sqlContext:

scala> object schema { 
| val weatherdata= StructType(Seq(
| StructField("date", StringType, true), 
| StructField("Region", StringType, true), 
| StructField("Temperature", DecimalType(32,16), true), 
| StructField("Solar", IntegerType, true), 
| StructField("Rainfall", DecimalType(32,16), true), 
| StructField("WindSpeed", DecimalType(32,16), true)) 
|) 
| } 
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:56288 in memory (size: 4.6 KB, free: 511.1 MB) 
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:39349 in memory (size: 4.6 KB, free: 2.7 GB) 
16/09/24 09:40:58 INFO ContextCleaner: Cleaned accumulator 2 
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost in memory (size: 1964.0 B, free: 511.1 MB) 
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:41412 in memory (size: 1964.0 B, free: 2.7 GB) 
16/09/24 09:40:58 INFO ContextCleaner: Cleaned accumulator 1 
defined module schema 

scala> val dataDF=sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),p(2),p(3),p(4),p(5))), schema.weatherdata) 
dataDF: org.apache.spark.sql.DataFrame = [date: string, Region: string, Temperature: decimal(32,16), Solar: int, Rainfall: decimal(32,16), WindSpeed: decimal(32,16)] 

然而,在最後一行代碼給我下面的:

16/09/24 09:41:03 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): scala.MatchError: 20.21666667 (of class java.lang.String) 

數字20.21666667的確是第一個觀察到的特定地理區域的溫度。我以爲我已經成功地指定溫度爲十進制類型(32,16)。我的代碼或者即使是我所調用的sqlContext有問題嗎?

至於建議,我改變了dataDF到如下:

val dataDF= sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),BigDecimal(p(2)),p(3),BigDecimal(p(4)),BigDecimal(p(5)))), schema.weatherdata) 

不幸的是,我現在得到的鑄件問題

16/09/24 10:31:35 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer 

回答

0

這可能是因爲您從.csv文件讀取這些數據。默認情況下,它將數據作爲「文本/字符串」格式。您可以通過2種方法解決此問題。 1.在.csv文件中更改屬性溫度的數據類型。 2.val temperatureInDecimal = BigDecimal(「20.21666667」)

如果您想從未來的角度製作您的應用程序,其中.csv文件可以更改,我會推薦使用第二種方法。

+0

爲什麼這麼說也許,給自信的回答..詢問有關的問題的詳細信息案例信息是不夠的。這樣你的答案會更有用。 – pamu

+0

感謝您的幫助......然而,現在出現了投射問題......請參閱編輯。 再次感謝您的幫助! –

1

在你的第一個編輯的代碼幾乎是正確的 - P(3)必須轉換toInt

我創建無頭示例CSV文件:

2016,a,201.222,12,12.1,5.0 
2016,b,200.222,13,12.3,6.0 
2014,b,200.111,14,12.3,7.0 

結果:

val dataDF= sqlContext.createDataFrame(rawData.map(p=>Row(p(0),p(1),BigDecimal(p(2)),p(3).toInt,BigDecimal(p(4)),BigDecimal(p(5)))), schema.weatherdata) 

dataDF.show 
+----+------+--------------------+-----+-------------------+------------------+ 
|date|Region|   Temperature|Solar|   Rainfall|   WindSpeed| 
+----+------+--------------------+-----+-------------------+------------------+ 
|2016|  a|201.2220000000000000| 12|12.1000000000000000|5.0000000000000000| 
|2016|  b|200.2220000000000000| 13|12.3000000000000000|6.0000000000000000| 
|2014|  b|200.1110000000000000| 14|12.3000000000000000|7.0000000000000000| 
+----+------+--------------------+-----+-------------------+------------------+ 
+0

哇!神奇!!!!謝謝Piotr R和Sakalya !!!!很棒! –

0

既然您知道預期的模式,最好跳過手動解析並使用正確的輸入格式。對於星火1.6/2.10斯卡拉包括spark-csv包(--packages com.databricks:spark-csv_2.10:1.4.0)和:

val sqlContext: SQLContext = ??? 
val path: String = ??? 

sqlContext.read 
    .format("csv") 
    .schema(schema.weatherdata).option("header", "true") 
    .load(path) 

爲2.0+:

val spark: SparkSession = ??? 
val path: String = ??? 

spark.read 
    .format("csv") 
    .schema(schema.weatherdata).option("header", "true") 
    .load(path) 
+0

Thanks @ zero323 ...我在本週安裝了databricks csv數據源包。非常感謝! –

相關問題