2016-10-21 112 views
0

我有一個jsonfile是parsed.The JSON格式解析jsonfile是這樣的:如何與火花

{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}} 

我必須讓每一個字在file.How我可以從一個陣列得到"major"我是否必須使用方法df.select("cv_parse.basic_info.location.province")來得到「省」字?

這是我想要的結果:

cv_id major degree birthyear state 
001 English Bachelor 1984  New York 
001 English Master  1984  New York 

回答

0

這可能不是做的最好的方式,但你可以給它一個鏡頭。

// import the implicits functions 
import org.apache.spark.sql.functions._ 
import sqlContext.implicits._ 

//read the json file 
val jsonDf = sqlContext.read.json("sample-data/sample.json") 

jsonDf.printSchema 

你的模式將是:

root 
|-- cv_id: string (nullable = true) 
|-- cv_parse: struct (nullable = true) 
| |-- basic_info: struct (nullable = true) 
| | |-- birthyear: string (nullable = true) 
| | |-- location: struct (nullable = true) 
| | | |-- state: string (nullable = true) 
| |-- educations: array (nullable = true) 
| | |-- element: struct (containsNull = true) 
| | | |-- degree: string (nullable = true) 
| | | |-- major: string (nullable = true) 

現在,您需要可以有爆炸educations

val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"), 
     $"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state") 

    explodedResult.printSchema 

現在您的架構將是

root 
|-- cv_id: string (nullable = true) 
|-- col: struct (nullable = true) 
| |-- degree: string (nullable = true) 
| |-- major: string (nullable = true) 
|-- birthyear: string (nullable = true) 
|-- state: string (nullable = true) 

現在你可以選擇列umns

explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show 

+-----+---------+--------+--------+-------+ 
|cv_id|birthyear| state| degree| major| 
+-----+---------+--------+--------+-------+ 
| 001|  1984|New York|Bachelor|English| 
| 001|  1984|New York| Master |English| 
+-----+---------+--------+--------+-------+