2017-05-24 58 views
0

我有一個看起來像下面的XML文檔:使用「屬性屬性」映射蜂巢XML SERDE

<root> 
<unwanted> 
    ... 
</unwanted> 
<wanted version="A"> 
    <unwanted2 type='1'> 
    ... 
    </unwanted2> 
    <unwanted2 type='2'> 
    ... 
    </unwanted2> 
    <unwanted2 type='3'> 
    ... 
    </unwanted2> 
    <wanted2> 
    <detail> 
    <row date="Jan-17" price="100" inventory="50"> 
    <row date="Feb-17" price="101" inventory="40"> 
    <row date="Mar-17" price="102" inventory="30"> 
    </detail> 
    </wanted2> 
</wanted> 
<wanted version="B"> 
    <unwanted2 type='1'> 
    ... 
    </unwanted2> 
    <unwanted2 type='2'> 
    ... 
    </unwanted2> 
    <unwanted2 type='3'> 
    ... 
    </unwanted2> 
    <wanted2> 
    <detail> 
    <row date="Jan-17" price="200" inventory="60"> 
    <row date="Feb-17" price="201" inventory="70"> 
    <row date="Mar-17" price="202" inventory="80"> 
    </detail> 
    </wanted2> 
</wanted> 
</root> 

我想將文件導入到一個Hive表,最好爲這種格式:

Version | Date | Price | Inventory 
A   Jan-17 100  50 
A   Feb-17 101  40 
A   Mar-17 102  30 
B   Jan-17 200  60 
B   Feb-17 201  70 
B   Mar-17 202  80 

但我會滿足於現在導入它作爲一個地圖日期和價格:

version | spot_date 
A   {Date: Jan-17, Price: 100, Inventory: 50} 
A   {Date: Feb-17, ...} 
A   {Date: Mar-17, ...} 
B   {Date: Jan-17, ...} 
B   {Date: Feb-17, ...} 
B   {Date: Mar-17, ...} 

我正在嘗試使用XMLSerDe for Hive,並使用「attribute to attribute」功能。

我的查詢看起來像下面:

CREATE EXTERNAL TABLE ppa_test(
    version  STRING, 
    spot_date  MAP<STRING,STRING> 
) 
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' 
WITH SERDEPROPERTIES (
    "column.xpath.version"="/wanted/@version", 
    "column.xpath.spot_date"="/wanted/wanted2/detail/row", 
    "xml.map.specification.row"="date->@date" 
) 
STORED AS 
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' 
TBLPROPERTIES (
"xmlinput.start"="<wanted ", 
"xmlinput.end"="</wanted>" 
); 

但是,當我在加載數據,我得到:

version | spot_date 
A   {"row":"Mar-17"} 
B   {"row":"Mar-17"} 

如果我不是更改爲xml.map.spec路徑:

"xml.map.specification.row"="@date->@price" 

我可以分別讀取每一行XML,但它被記錄到同一個Hive錶行中,而且我也寧願使用屬性名稱:

Version | spot_date 
A   {"Mar-17":"102", "Feb-17":"101", "Jan-17":"100"} 
B   {"Mar-17":"202", "Feb-17":"201", "Jan-17":"200"} 
  1. 我怎麼能每個XML row節點記錄到自己的蜂巢記錄
  2. 如何使用屬性名稱(或自定義字符串)爲重點?

編輯

所以從spot_date MAP<STRING,STRING>改變......

CREATE EXTERNAL TABLE ppa_test(
    scenario STRING, 
    spot_date array<struct< 
     date:  string, 
     price:  string, 
     inventory: string, 
    >> 
)... 

給我對象

Version | spot_date 
A   [{date: Jan-17, price: 100, inventory: 50}, 
      {date: Feb-17, price: 101, inventory: 40}, 
      {date: Mar-17, price: 102, inventory: 30}] 
B   [{date: Jan-17, ... ] 

數組從上述完成#2 ,但是仍然不確定#1

回答

0

您可以分解爲#2創建的結構數組,以獲得#1。

CREATE EXTERNAL TABLE ppa_test(
    scenario STRING, 
    spot_date ARRAY<STRUCT<spotdates: struct< 
     date:  string, 
     price:  string, 
     inventory: string, 
    >>> 
) 

您可以使用橫向查看該

DROP TABLE IF EXISTS ppa_test_exploded; 
CREATE TABLE ppa_test_exploded as 
SELECT scenario, 
    SD.spotdates.date as date, 
    SD.spotdates.price as price, 
    SD.spotdates.inventory as inventory 
    FROM ppa_test 
    LATERAL VIEW EXPLODE(spot_date) exploded as SD; 

希望這有助於。