2016-08-07 102 views
0

我想創建一個表,將顯示百分比的出現次數。例如:我有一個表,命名爲例如包含數據爲:PIG:如何創建基於百分比(%)的表?

class, value 
------ ------- 
1  , abc 
1  , abc 
1  , xyz 
1  , abc 
2  , xyz 
2  , abc 

這裏,對於類值1,「ABC」時發生3次和「XYZ」只發生一次出總髮生的的4倍。對於班級值2,「abc」和「xyz」發生一次(總共出現兩次)。

所以,輸出是:

class, %_of_abc, %_of_xyz 
------ -------- -------- 
1  , 75  , 25 
2  , 50  , 50 

任何想法如何做到這一點其中兩個列值發生改變?我正在考慮使用GROUP。但不知道我是否按照課程價值分組,如何幫助我。

回答

0

有點複雜,但這裏的解決方案

grunt> Dump A; 
(1,abc) 
(1,abc) 
(1,xyz) 
(1,abc) 
(2,xyz) 
(2,abc) 
grunt> B = Group A by class; 
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt; 
grunt> D = Group A by (class,value);   
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt; 
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt; 
grunt> G = JOIN F BY class,C BY class; 
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc; 
grunt> Dump H; 
(1,xyz,25) 
(1,abc,75) 
(2,xyz,50) 
(2,abc,50) 
I = grouy H by class; 
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc)); 
Dump J; 
(1,75,25) 
(2,50,50) 
+0

謝謝!完美地工作! – Tanvir