2016-04-07 23 views
3

讓說我有一個數據幀如下:如何從Spark數組列中選擇字段子集?

case class SubClass(id:String, size:Int,useless:String) 
case class MotherClass(subClasss: Array[SubClass]) 
val df = sqlContext.createDataFrame(List(
     MotherClass(Array(
     SubClass("1",1,"thisIsUseless"), 
     SubClass("2",2,"thisIsUseless"), 
     SubClass("3",3,"thisIsUseless") 
    )), 
     MotherClass(Array(
     SubClass("4",4,"thisIsUseless"), 
     SubClass("5",5,"thisIsUseless") 
    )) 
    )) 

的模式是:

root 
|-- subClasss: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- id: string (nullable = true) 
| | |-- size: integer (nullable = false) 
| | |-- useless: string (nullable = true) 

我正在尋找一種方式來僅選擇字段的子集:數組的idsizesubClasss,但保留了嵌套的數組結構。 生成的模式將是:

root 
    |-- subClasss: array (nullable = true) 
    | |-- element: struct (containsNull = true) 
    | | |-- id: string (nullable = true) 
    | | |-- size: integer (nullable = false) 

我試圖做一個

df.select("subClasss.id","subClasss.size") 

但這種拆分陣列subClasss兩個數組:

root 
|-- id: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- size: array (nullable = true) 
| |-- element: integer (containsNull = true) 

有沒有一種辦法保持原來的結構,只是爲了消除useless領域?東西看起來像:

df.select("subClasss.[id,size]") 

感謝您的時間。

回答

4

您可以使用像這樣的UDF:

import org.apache.spark.sql.Row 

case class Record(id: String, size: Int) 

val dropUseless = udf((xs: Seq[Row]) => xs.map{ 
    case Row(id: String, size: Int, _) => Record(id, size) 
}) 

df.select(dropUseless($"subClasss"))