2014-02-28 61 views
1

我有一個存儲在配置單元日誌表中的XML blob(如下所示)。HiveQL和XPath - 如何提取值並替換某些字符

<user> 
    <uid>1424324325</uid> 
    <attribs> 
     <field> 
     ... 
     </field> 
     <field> 
      <name>first</name> 
      <value>Joh,n</value> 
     </field> 
     <field> 
     ... 
     </field> 
     <field> 
      <name>last</name> 
      <value>D,oe</value> 
     </field> 
     <field> 
     ... 
     </field> 
    </attribs> 
</user> 

在蜂巢表的每一行將會有關於不同用戶的信息,我想提取UID,名字和姓氏(從名稱中刪除任何逗號)的值。

1424324325 John Doe 
1424435463 Jane Smith 

我能夠從XML中提取值。

SELECT uid, fn, ln 
FROM log_table 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln; 

但是,我得到難倒試圖從名字&姓氏中刪除不必要的逗號(如果存在的話)。

當我嘗試使用下面顯示的任何方法提取名字時,結果爲空。

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn 

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn 

當我嘗試它如下所示,替換抱怨關於無效函數,而翻譯拉動數據而不刪除額外的逗號。

LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn 

LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn 

如何在名稱值中提取沒有逗號的信息?

1424324325 John Doe 
1424435463 Jane Smith 

最終解決方案: 這裏是延的建議

SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname 
FROM log_table 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn 
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln; 

回答

1

後的最終工作查詢沒有爲的XPath 2.0蜂巢的支持。這會影響您的問題兩次:

  • 函數調用軸步驟是不允許的。雖然//value/translate(text(), ',', '')(它調用每個<value/>元素的轉換)是有效的XPath 2.0,但您無法在XPath 1.0中執行此操作。 translate(//value, ',', '')另一方面返回連接爲單個字符串的所有<value/>項目中的所有文本節點。
  • XPath 1.0中沒有replace函數。

只傳遞包含逗號的值並在Hive中進行字符串操作可能更容易。

附加說明,因爲您還沒有獲得XPath 2.0:translate只需要一個字符串作爲第一個參數。你以前需要string-join

+0

非常感謝您的信息。我不知道這些限制。 正如你所建議的,我可以通過在Hive中使用regexp_replace來實現它。 – rev