2017-04-12 64 views
-1

我的配置單元代碼中有一個問題。我想提取JSON數據使用HIVE.Following爲樣本JSON格式使用HIVE從JSON中提取字段

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"versionModified"{"machine":"123.dfer","founder":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 

我希望得到以下領域

  • 版本
  • 車輛
  • TS
  • 創始人
  • 狀態

問題是創始人和國家是在一個陣列「版本」 任何人都可以幫助如何擺脫這一點? 一些時間,而不是別的versionmedified東西可能會

如: 有些時候我的數據會是怎樣

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"anotherCriteria":{"engine":"123.dfer","developer":"3.0","state":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 

添加下面的一些樣本數據:

{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC"{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP"{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 


{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX"{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}} 

我需要把這個數據基於版本的各種表格如果它是「BOX」放在一個表中如果它是「GAP」把另一個表...

+0

指這在蜂巢使用get_json_object .. ..http://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive –

+1

請顯示您的表架構 –

+0

不要混淆問題。爲INSERTissue –

回答

1

您可以使用JSON SERDE獲取所有領域

只要按照下面的步驟從http://www.congiu.net/hive-json-serde/1.3/

2.增加JSON SERDE罐

hive> ADD jar /root/json-serde-1.3-jar-with-dependencies.jar; 
Added [/root/json-serde-1.3-jar-with-dependencies.jar] to class path 
Added resources: [/root/json-serde-1.3-jar-with-dependencies.jar] 

1.Download JSON SERDE

3.創建表格

CREATE TABLE json_serde_table (
    Rtype struct<ver:int, os:string,type:string,vehicle:string,MOD: struct<Version:Array<struct<versionModified:struct<machine:string,founder:string,state:string,fashion:string,cdc:string,dof:string,ts:string>>>>> 
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'; 

4.load JSON文件到下面查詢表

hive> load data local inpath '/root/json.txt' INTO TABLE json_serde_table; 
Loading data to table default.json_serde_table 
Table default.json_serde_table stats: [numFiles=1, totalSize=234] 
OK 
Time taken: 0.877 seconds 

5.Fire拿到導致

hive> select Rtype.ver ver ,Rtype.type type ,Rtype.vehicle vehicle ,Rtype.MOD.version[0].versionModified.ts ts,Rtype.MOD.version[0].versionModified.founder founder,Rtype.MOD.version[0].versionModified.state state from json_serde_table; 
Query ID = root_20170412170606_a674d31b-31d7-477b-b9ff-3ebd76636cf8 
Total jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks is set to 0 since there's no reduce operator 
Starting Job = job_1491484583384_0018, Tracking URL = http://mac127:8088/proxy/application_1491484583384_0018/ 
Kill Command = /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/bin/hadoop job -kill job_1491484583384_0018 
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 
2017-04-12 17:06:44,990 Stage-1 map = 0%, reduce = 0% 
2017-04-12 17:06:53,361 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec 
MapReduce Total cumulative CPU time: 1 seconds 800 msec 
Ended Job = job_1491484583384_0018 
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 4891 HDFS Write: 50 SUCCESS 
Total MapReduce CPU Time Spent: 1 seconds 800 msec 
OK 
1  ns  Mh-3412 2000-04-01T00:00:00.171Z  3.0  Florida 
Time taken: 19.745 seconds, Fetched: 1 row(s) 
+0

在「versionModified」字段後冒號(:)在你的json數據中缺失 –

+0

爲什麼當它已經是發行版的一部分時下載JSON SerDe? –

+0

你是正確的可以用org.apache.hadoop.hive.contrib.serde2.JsonSerde完成,我剛剛嘗試過,它給了我相同的結果..我將編輯我的答案..謝謝 –