3
讓說我有一個數據幀如下:如何從Spark數組列中選擇字段子集?
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
的模式是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
我正在尋找一種方式來僅選擇字段的子集:數組的id
和size
列subClasss
,但保留了嵌套的數組結構。 生成的模式將是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
我試圖做一個
df.select("subClasss.id","subClasss.size")
但這種拆分陣列subClasss
兩個數組:
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- size: array (nullable = true)
| |-- element: integer (containsNull = true)
有沒有一種辦法保持原來的結構,只是爲了消除useless
領域?東西看起來像:
df.select("subClasss.[id,size]")
感謝您的時間。