Impala是否可以查詢存儲在Hadoop/HDFS中的XML文件

我正在研究Hadoop/Impala組合是否能夠滿足我的歸檔，批處理和實時即席查詢要求。Impala是否可以查詢存儲在Hadoop/HDFS中的XML文件

我們將持久化XML文件（格式良好且符合我們自己的XSD架構）到Hadoop中，並使用MapReduce處理日終批量查詢等。對於需要低延遲和相對較低的特定用戶查詢和應用查詢高性能我們正在考慮Impala。

我無法弄清楚的是，Impala如何理解XML文件的結構以便能夠有效地進行查詢。可以使用Impala以有意義的方式跨XML文檔進行查詢嗎？

在此先感謝。

來源

2014-03-24 ng5000

蜂房和因帕拉真的沒有一種機制，以XML文件的工作（這是奇怪的，考慮到XML支持大多數數據庫）。這就是說，如果我遇到了這個問題，我會使用Pig將數據導入到HCatalog中。此時，Hive和Impala完全可以使用它。

這裏得到一些XML數據轉換成HCatalog使用豬的快速和骯髒的例子：

--rss.pig

REGISTER piggybank.jar 

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray); 

data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray, 
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray, 
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray; 

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer(); 


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader(); 
dump validate;

--results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21) 
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42) 
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)

--Impala查詢

select * from rss_items

--Impala導致

link title description pubdate 
0 http://www.hannonhill.com/news/item1.html News Item 1 Description of news item 1 here. 03 Jun 2003 09:39:21 
1 http://www.hannonhill.com/news/item2.html News Item 2 Description of news item 2 here. 30 May 2003 11:06:42 
2 http://www.hannonhill.com/news/item3.html News Item 3 Description of news item 3 here. 20 May 2003 08:56:02

--rss.txt數據文件

<rss version="2.0"> 
    <channel> 
     <title>News</title> 
     <link>http://www.hannonhill.com</link> 
     <description>Hannon Hill News</description> 
     <language>en-us</language> 
     <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate> 
     <generator>Cascade Server</generator> 
     <webMaster>[email protected]</webMaster> 
     <item> 
     <title>News Item 1</title> 
     <link>http://www.hannonhill.com/news/item1.html</link> 
     <description>Description of news item 1 here.</description> 
     <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate> 
     <guid>http://www.hannonhill.com/news/item1.html</guid> 
     </item> 
     <item> 
     <title>News Item 2</title> 
     <link>http://www.hannonhill.com/news/item2.html</link> 
     <description>Description of news item 2 here.</description> 
     <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate> 
     <guid>http://www.hannonhill.com/news/item2.html</guid> 
     </item> 
     <item> 
     <title>News Item 3</title> 
     <link>http://www.hannonhill.com/news/item3.html</link> 
     <description>Description of news item 3 here.</description> 
     <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate> 
     <guid>http://www.hannonhill.com/news/item3.html</guid> 
     </item> 
    </channel> 
</rss>

來源

2014-03-25 02:34:05 JamCon