2017-06-02 53 views
0

我有一個列,它的類型是數組< Struct>從json文件中推導出來的。 我想要將數組< Struct>轉換爲字符串,以便我可以將此數組列保留在配置單元中並將其作爲單列導出到RDBMS。spark scala:將Struct列的Array轉換爲String列

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id": 
{"value":"296160"},"sku_id": 
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}} 

處理:

scala> val temp = spark.read.json("s3://check/1/temp1.json") 
temp: org.apache.spark.sql.DataFrame = [properties: struct<items: 
array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>] 

    scala> temp.printSchema 
    root 
    |-- properties: struct (nullable = true) 
    | |-- items: array (nullable = true) 
    | | |-- element: struct (containsNull = true) 
    | | | |-- invoicid: struct (nullable = true) 
    | | | | |-- value: string (nullable = true) 
    | | | |-- job_id: struct (nullable = true) 
    | | | | |-- value: string (nullable = true) 
    | | | |-- sku_id: struct (nullable = true) 
    | | | | |-- value: string (nullable = true) 
    | |-- user_id: string (nullable = true) 
    | |-- zip_code: string (nullable = true) 


scala> temp.select("properties").show 
+--------------------+ 
|   properties| 
+--------------------+ 
|[WrappedArray([[9...| 
+--------------------+ 


scala> temp.select("properties.items").show 
+--------------------+ 
|    items| 
+--------------------+ 
|[[[923659],[29616...| 
+--------------------+ 


scala> temp.createOrReplaceTempView("tempTable") 

scala> spark.sql("select properties.items from tempTable").show 
+--------------------+ 
|    items| 
+--------------------+ 
|[[[923659],[29616...| 
+--------------------+ 

我怎樣才能像結果:

+-----------------------------------------------------------------------------------------+ 
|    items                  | 
+-----------------------------------------------------------------------------------------+ 
[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] | 
+-----------------------------------------------------------------------------------------+ 

得到數組元素值而沒有任何變化。

回答

4

to_json是你要找的

import org.apache.spark.sql.functions.to_json: 

val df = spark.read.json(sc.parallelize(Seq(""" 
    {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id": 
    {"value":"296160"},"sku_id": 
    {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}"""))) 


df 
    .select(get_json_object(to_json($"properties"), "$.items").alias("items")) 
    .show(false) 
+-----------------------------------------------------------------------------------------+ 
|items                     | 
+-----------------------------------------------------------------------------------------+ 
|[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}]| 
+-----------------------------------------------------------------------------------------+ 
+0

我正好找這個.Thanks你非常亟待解決的功能。 –

相關問題