2012-09-28 58 views
6

我想從嵌套的JSON中獲取一些數值爲數百萬行(5 TB +表)。什麼是最有效的方法來做到這一點?蜂巢:解析JSON

下面是一個例子:

{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}} 

我需要這些值超出上述JSON的:

Country  Page  impressions_s  impressions_o 
---------  -----  -------------  -------------- 
US    2  10     10 

這是蜂巢的json_tuple功能,我不知道這是否是最好的功能。 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject

回答

3

您可以使用get_json_object :

select get_json_object(fieldname, '$.country'), 
     get_json_object(fieldname, '$.data.ad.s') from ... 

你會得到更好的性能與json_tuple,但我發現了一個「如何」在json中獲取json中的值; 要格式化你的表,你可以使用類似這樣的東西:

from table t lateral view explode(split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',')) tb1 as s 上面的代碼將轉換你的「數組」列中。

形式更多:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

我希望這幫助...

6

這裏是你可以快速嘗試,我會建議使用Json-Ser-De

納米/tmp/hive-parsing-json.json

{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}} 

創建基表:

hive > CREATE TABLE hive_parsing_json_table (json string); 

負載JSON文件表:

hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table; 

查詢表:

hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o 
from hive_parsing_json_table hpjp 
    LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1 
    as Country, Page, data 
    LATERAL VIEW json_tuple(v1.data, 'ad') v2 
    as Ad 
    LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3 
    as Impressions 
    LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4 
    as impressions_s,impressions_o; 

輸出:

v1.country v1.page  v4.impressions_s v4.impressions_o 
US  227  10   10 
0

使用蜂巢原生json-serde('org.apache.hive.hcatalog.data.JsonSerDe')你可以做到這一點..這裏的步驟

添加JAR /路徑/到/蜂房hcatalog核心。罐;

create a table as below 
CREATE TABLE json_serde_nestedjson (
    country string, 
    page int, 
    data struct < ad: struct < impressions: struct < s:int, o:int > > > 
) 
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; 

然後加載(存儲在文件)數據

LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson; 

使用然後得到所需的數據

SELECT country, page, data.ad.impressions.s, data.ad.impressions.o 
FROM json_serde_nestedjson;