2015-09-18 67 views
1

我有如下數據:如何使用ARRAY_AGG()聚合函數在豬或蜂房

================================================================ 
session_id      screen_name screen_launch_time 
================================================================ 
990004916946605-1404157897784 screen1  1404157898275 
990004916946605-1404157897784 screen2  1404157898337 
990004947764274-1435162269418 screen1  1435162274044 
990004947764274-1435162269418 screen3  1435162274081 

我想用一個array_agg函數來獲得在下面的格式我的數據:

========================================================= 
session_id      screen_flow   count 
========================================================= 
990004916946605-1404157897784 screen1->screen2 1 
990004947764274-1435162269418 screen1->screen3 1 

有沒有人試過編寫UDAFpython腳本來實現array_agg函數中使用的邏輯?

請分享您的想法。

+0

蜂巢有一個內置的'collect_set()'和'collect_list()',其聚合項的數組。這裏有一個udf,它可以做同樣的事情https://github.com/klout/brickhouse/tree/master/src/main/java/brickhouse/udf/collect – gobrewers14

+0

嗨,它給了我這個錯誤: – explorethis

+0

FAILED:ParseException行1:0字符''這裏不支持 – explorethis

回答

3

只由session_id組成,concat screen_name,並計算每組的記錄數。如果您不想製作brickhouse罐子,您可以使用collect_list()而不是collect()(但我不推薦它)。

查詢

add jar /path/to/jars/brickhouse-0.7.1.jar; 
create temporary function collect as "brickhouse.udf.collect.CollectUDAF"; 

select session_id, screen_flow 
    , count(*) count 
from (
    select session_id 
    , concat_ws('->', collect(screen_name)) screen_flow 
    from db.table 
    group by session_id) x 
group by session_id, screen_flow 

輸出

990004916946605-1404157897784 screen1->screen2 1 
990004947764274-1435162269418 screen1->screen3 1 
+0

好的一個GoBrewers ..我還需要做其他2件事:1.根據會話ID分組後刪除重複的屏幕(如果有的話)2.根據最大計數對它們進行排名。你能幫忙嗎? – explorethis

+1

如果我的答案解決了您原來的問題,請將其標記爲正確,然後如果您還有其他問題,請提出一個新問題,如果可以的話,我一定會提供幫助。 – gobrewers14

+0

喜GoBrewers,這裏是鏈接的新問題 - http://stackoverflow.com/questions/32681157/how-to-find-the-pathing-flow-and-rank-them-using-pig-or-hive – explorethis

1

輸入: -

990004916946605-1404157897784,screen1,1404157898275 
990004916946605-1404157897784,screen2,1404157898337 
990004947764274-1435162269418,screen1,1435162274044 
990004947764274-1435162269418,screen3,1435162274081 

下面是豬風格答案..

records = LOAD '/user/user/inputfiles/session_id.txt' USING PigStorage(',') AS (session_id:chararray,screen_name:chararray,screnn_launch_time:chararray); 

rec_grped = GROUP records BY session_id; 

rec_each = FOREACH rec_grped 
        { 
         rec_inner_each = FOREACH records GENERATE screen_name; 

         GENERATE group as session_id, REPLACE(BagToString(rec_inner_each),'_','-->') as screen_flow, 1 as cnt; 
}; 

dump rec_each; 

輸出: -

990004916946605-1404157897784 screen1-->screen2 1 
990004947764274-1435162269418 screen1-->screen3 1 
+0

謝謝Surender。好的。我還需要做其他2件事情:1.根據會話ID分組後刪除重複屏幕(如果有)2.根據最大計數對它們進行排名。你能幫忙嗎? – explorethis

+0

好。對於重複的情況下,只要給我輸入和預期的輸出..此外,如果你問它作爲一個單獨的question..thanks –

+0

嗨Surender,我已經張貼作爲一個單獨的問題,這將是很好 - – explorethis