2015-12-25 52 views
0

的特定信息預期輸出爲:(Hadoop definitive guide,Tom white,24.90)我想從XML輸入中提取使用Hadoop豬拉丁文

我試過使用Regex_Extract()函數。但是,沒有運氣。有人可以幫我嗎?

輸入到我的腳本是:

<CATALOG> 
<BOOK> 
<TITLE>Hadoop DEFINITIVE GUIDE</TITLE> 
<AUTHOR>TOM WHITE</AUTHOR> 
<COUNTRY>US</COUNTRY> 
<COMPANY>CLOUDERA</COMPANY> 
<PRICE>24.90</PRICE> 
<YEAR>2012</YEAR> 
</BOOK> 
<BOOK> 
<TITLE>Programming Pig</TITLE> 
<AUTHOR>Alan Gates</AUTHOR> 
<COUNTRY>USA</COUNTRY> 
<COMPANY>Horton Works</COMPANY> 
<PRICE>30.90</PRICE> 
<YEAR>2013</YEAR> 
</BOOK> 
</CATALOG> 
+0

你是什麼豬的版本?我猜可以從Pig 0.9中獲得Rank。劇本,我完美地寫了作品。 –

回答

0

你將不得不提取<TITLE><AUTHOR><PRICE>分開,然後用JOIN運營商加入他們在一起。

下面的腳本實現了:

-- Load input 
A = LOAD '/input.txt' USING PigStorage() AS (f1:chararray); 

-- Extract <TITLE> 
B1 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<TITLE>(.*)</TITLE>', 1) AS (title:chararray); 
C1 = FILTER B1 BY title is not null; 
D1 = RANK C1; 

-- Extract <AUTHOR> 
B2 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<AUTHOR>(.*)</AUTHOR>', 1) AS (author:chararray); 
C2 = FILTER B2 BY author is not null; 
D2 = RANK C2; 

-- Extract <PRICE> 
B3 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<PRICE>(.*)</PRICE>', 1) AS (price:chararray); 
C3 = FILTER B3 BY price is not null; 
D3 = RANK C3; 

-- Join 3 data sets 
D = JOIN D1 BY $0, D2 BY $0, D3 By $0; 

-- Eliminate the ranks 
E = FOREACH D GENERATE $1 AS (title:chrarray), $3 AS (author:chararray), $5 AS (price:chararray) 

dump E; 

對於問題中提及的投入,我得到了以下的輸出:

(Hadoop DEFINITIVE GUIDE,TOM WHITE,24.90) 
(Programming Pig,Alan Gates,30.90) 
+0

好吧,我能夠提取個人數據,但米無法加入3個數據集..獲取解析錯誤。在解析過程中出現org.apache.pig.tools.grunt.Grunt-Erroe 10000錯誤:錯誤。在第1行遇到「......」也無法執行Rank cmd ..仍然需要修改上面的命令,我可以提取dem ...不能加入dem..what mi doing wrong..plz help ... – Mrudula

+0

B = foreach A GENERATE FLATTEN(REGEX_EXTRACT(x,'(。*)',1))AS(title:chararray);我提取了個人資料.. – Mrudula

+0

您正在使用哪個版本的豬?我的豬版本是0.14。這個腳本完全適合我。我甚至發佈了通過在我的設置中運行腳本獲得的答案。你能檢查一下'pig --version'嗎?可能你的豬版本不支持'Rank'。從Pig 0.11開始支持'Rank'函數。 –