0
我有一個看起來像下面的XML文檔:使用「屬性屬性」映射蜂巢XML SERDE
<root>
<unwanted>
...
</unwanted>
<wanted version="A">
<unwanted2 type='1'>
...
</unwanted2>
<unwanted2 type='2'>
...
</unwanted2>
<unwanted2 type='3'>
...
</unwanted2>
<wanted2>
<detail>
<row date="Jan-17" price="100" inventory="50">
<row date="Feb-17" price="101" inventory="40">
<row date="Mar-17" price="102" inventory="30">
</detail>
</wanted2>
</wanted>
<wanted version="B">
<unwanted2 type='1'>
...
</unwanted2>
<unwanted2 type='2'>
...
</unwanted2>
<unwanted2 type='3'>
...
</unwanted2>
<wanted2>
<detail>
<row date="Jan-17" price="200" inventory="60">
<row date="Feb-17" price="201" inventory="70">
<row date="Mar-17" price="202" inventory="80">
</detail>
</wanted2>
</wanted>
</root>
我想將文件導入到一個Hive
表,最好爲這種格式:
Version | Date | Price | Inventory
A Jan-17 100 50
A Feb-17 101 40
A Mar-17 102 30
B Jan-17 200 60
B Feb-17 201 70
B Mar-17 202 80
但我會滿足於現在導入它作爲一個地圖日期和價格:
version | spot_date
A {Date: Jan-17, Price: 100, Inventory: 50}
A {Date: Feb-17, ...}
A {Date: Mar-17, ...}
B {Date: Jan-17, ...}
B {Date: Feb-17, ...}
B {Date: Mar-17, ...}
我正在嘗試使用XMLSerDe for Hive,並使用「attribute to attribute」功能。
我的查詢看起來像下面:
CREATE EXTERNAL TABLE ppa_test(
version STRING,
spot_date MAP<STRING,STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.version"="/wanted/@version",
"column.xpath.spot_date"="/wanted/wanted2/detail/row",
"xml.map.specification.row"="date->@date"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<wanted ",
"xmlinput.end"="</wanted>"
);
但是,當我在加載數據,我得到:
version | spot_date
A {"row":"Mar-17"}
B {"row":"Mar-17"}
如果我不是更改爲xml.map.spec
路徑:
"xml.map.specification.row"="@date->@price"
我可以分別讀取每一行XML,但它被記錄到同一個Hive錶行中,而且我也寧願使用屬性名稱:
Version | spot_date
A {"Mar-17":"102", "Feb-17":"101", "Jan-17":"100"}
B {"Mar-17":"202", "Feb-17":"201", "Jan-17":"200"}
- 我怎麼能每個XML
row
節點記錄到自己的蜂巢記錄 - 如何使用屬性名稱(或自定義字符串)爲重點?
編輯
所以從spot_date MAP<STRING,STRING>
改變......
CREATE EXTERNAL TABLE ppa_test(
scenario STRING,
spot_date array<struct<
date: string,
price: string,
inventory: string,
>>
)...
給我對象
Version | spot_date
A [{date: Jan-17, price: 100, inventory: 50},
{date: Feb-17, price: 101, inventory: 40},
{date: Mar-17, price: 102, inventory: 30}]
B [{date: Jan-17, ... ]
數組從上述完成#2 ,但是仍然不確定#1