與豬

2014-10-31 66 views
0

,我有以下的數據集找到最frequant值:與豬

dump DATA_INPUT; 
    (0000001686601081020,10A) 
    (0000001686601081020,08D) 
    (0000001686601081020,08D) 
    (0000001686601081020,08D) 
    (0000001686601081020,09D) 
    (0000001686601081020,09D) 
    (0000001686601081020,08D) 
    (0000001686601081020,08D) 
    (0000001686601081020,08D) 
    (0000001686676950125,0A1) 
    (0000001686676950125,0A1) 
    (0000001686676950125,0A2) 

列$ 0 ACCOUNT_ID,列$ 1單元ID。

對於每個account_id我需要找到最frequant單元ID。

第一步,我試圖做的是:

grpd = group DATA_INPUT by ($0, $1); 
cells_count = foreach grpd GENERATE group, COUNT(DATA_INPUT.$1) AS count; 
all_cells_counts = GROUP cells_count BY group.$0; 
    top_cell = FOREACH all_cells_counts { 
     A = ORDER cells_count BY count DESC; 
     B = LIMIT A 1; 
     GENERATE FLATTEN(B.group); 
    } 

我得到的rezult:

 ((0000001686601081020,08D)) 
    ((0000001686676950125,0A1)) 

我怎樣才能擺脫括號(的),有在rezult

 (0000001686601081020,08D) 
    (0000001686676950125,0A1) 

回答

1

做top_cell的FLATTEN

final_result = FOREACH top_cell GENERATE FLATTEN($0); 
+0

非常感謝!那是我在很多方面試圖做的\t 不成功:) – Marta 2014-10-31 17:33:16