使用「屬性屬性」映射蜂巢XML SERDE

我有一個看起來像下面的XML文檔：使用「屬性屬性」映射蜂巢XML SERDE

<root> 
<unwanted> 
    ... 
</unwanted> 
<wanted version="A"> 
    <unwanted2 type='1'> 
    ... 
    </unwanted2> 
    <unwanted2 type='2'> 
    ... 
    </unwanted2> 
    <unwanted2 type='3'> 
    ... 
    </unwanted2> 
    <wanted2> 
    <detail> 
    <row date="Jan-17" price="100" inventory="50"> 
    <row date="Feb-17" price="101" inventory="40"> 
    <row date="Mar-17" price="102" inventory="30"> 
    </detail> 
    </wanted2> 
</wanted> 
<wanted version="B"> 
    <unwanted2 type='1'> 
    ... 
    </unwanted2> 
    <unwanted2 type='2'> 
    ... 
    </unwanted2> 
    <unwanted2 type='3'> 
    ... 
    </unwanted2> 
    <wanted2> 
    <detail> 
    <row date="Jan-17" price="200" inventory="60"> 
    <row date="Feb-17" price="201" inventory="70"> 
    <row date="Mar-17" price="202" inventory="80"> 
    </detail> 
    </wanted2> 
</wanted> 
</root>

我想將文件導入到一個Hive表，最好爲這種格式：

Version | Date | Price | Inventory 
A   Jan-17 100  50 
A   Feb-17 101  40 
A   Mar-17 102  30 
B   Jan-17 200  60 
B   Feb-17 201  70 
B   Mar-17 202  80

但我會滿足於現在導入它作爲一個地圖日期和價格：

version | spot_date 
A   {Date: Jan-17, Price: 100, Inventory: 50} 
A   {Date: Feb-17, ...} 
A   {Date: Mar-17, ...} 
B   {Date: Jan-17, ...} 
B   {Date: Feb-17, ...} 
B   {Date: Mar-17, ...}

我正在嘗試使用XMLSerDe for Hive，並使用「attribute to attribute」功能。

我的查詢看起來像下面：

CREATE EXTERNAL TABLE ppa_test(
    version  STRING, 
    spot_date  MAP<STRING,STRING> 
) 
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
WITH SERDEPROPERTIES (
    "column.xpath.version"="/wanted/@version", 
    "column.xpath.spot_date"="/wanted/wanted2/detail/row", 
    "xml.map.specification.row"="date->@date" 
) 
STORED AS 
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
TBLPROPERTIES (
"xmlinput.start"="<wanted ", 
"xmlinput.end"="</wanted>" 
);

但是，當我在加載數據，我得到：

version | spot_date 
A   {"row":"Mar-17"} 
B   {"row":"Mar-17"}

如果我不是更改爲xml.map.spec路徑：

"xml.map.specification.row"="@date->@price"

我可以分別讀取每一行XML，但它被記錄到同一個Hive錶行中，而且我也寧願使用屬性名稱：

Version | spot_date 
A   {"Mar-17":"102", "Feb-17":"101", "Jan-17":"100"} 
B   {"Mar-17":"202", "Feb-17":"201", "Jan-17":"200"}

我怎麼能每個XML row節點記錄到自己的蜂巢記錄
如何使用屬性名稱（或自定義字符串）爲重點？

編輯

所以從spot_date MAP<STRING,STRING>改變......

CREATE EXTERNAL TABLE ppa_test(
    scenario STRING, 
    spot_date array<struct< 
     date:  string, 
     price:  string, 
     inventory: string, 
    >> 
)...

給我對象

Version | spot_date 
A   [{date: Jan-17, price: 100, inventory: 50}, 
      {date: Feb-17, price: 101, inventory: 40}, 
      {date: Mar-17, price: 102, inventory: 30}] 
B   [{date: Jan-17, ... ]

數組從上述完成＃2 ，但是仍然不確定＃1

來源

2017-05-24 getglad

您可以分解爲＃2創建的結構數組，以獲得＃1。

CREATE EXTERNAL TABLE ppa_test(
    scenario STRING, 
    spot_date ARRAY<STRUCT<spotdates: struct< 
     date:  string, 
     price:  string, 
     inventory: string, 
    >>> 
)

您可以使用橫向查看該

DROP TABLE IF EXISTS ppa_test_exploded; 
CREATE TABLE ppa_test_exploded as 
SELECT scenario, 
    SD.spotdates.date as date, 
    SD.spotdates.price as price, 
    SD.spotdates.inventory as inventory 
    FROM ppa_test 
    LATERAL VIEW EXPLODE(spot_date) exploded as SD;

希望這有助於。

來源

2017-09-21 07:52:05 kndarp

使用「屬性屬性」映射蜂巢XML SERDE

回答

相關問題