蜂房和因帕拉真的沒有一種機制,以XML文件的工作(這是奇怪的,考慮到XML支持大多數數據庫)。這就是說,如果我遇到了這個問題,我會使用Pig將數據導入到HCatalog中。此時,Hive和Impala完全可以使用它。
這裏得到一些XML數據轉換成HCatalog使用豬的快速和骯髒的例子:
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();
validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;
--results
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
--Impala查詢
select * from rss_items
--Impala導致
link title description pubdate
0 http://www.hannonhill.com/news/item1.html News Item 1 Description of news item 1 here. 03 Jun 2003 09:39:21
1 http://www.hannonhill.com/news/item2.html News Item 2 Description of news item 2 here. 30 May 2003 11:06:42
2 http://www.hannonhill.com/news/item3.html News Item 3 Description of news item 3 here. 20 May 2003 08:56:02
--rss.txt數據文件
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>[email protected]</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>