2016-01-26 54 views
0

因此,我在hdfs中有以下數據。如何查找豬中重複用戶的數量

user_id, category_id 
1, 12344 
1, 12344 
1, 12345 
2, 12345 
2, 12345 
3, 12344 
3, 12344 

等等..我想找出重複的用戶每個類別獲得的數..

如此,例如上面..

12344, 2 (because user_id 1 and 3 are repeated users) 
12345, 1 (user_id 2 is repeated user.. 1 is not as that user visited just once) 

如何在豬做?

回答

1

在第一次嘗試僅保留重複的用戶,然後應用分組和計數他們會在溶液中結束。請嘗試按下面的代碼

輸入:

1,12344 
1,12344 
1,12345 
2,12345 
2,12345 
3,12344 
3,12344 

豬腳本:

records = LOAD '/home/inputfiles/repeats.txt' USING PigStorage(',') AS(id:int,category:int); 

records_grp = GROUP records BY (id,category); 

records_each = FOREACH records_grp GENERATE FLATTEN(group) AS(id,category), (COUNT(records.id) >1 ?'Y' : 'N') as repeat_ind; 

records_filter = FILTER records_each BY repeat_ind == 'Y'; 

rec_grp = GROUP records_filter BY category; 

rec_each = FOREACH rec_grp GENERATE group as category, COUNT(records_filter) as cnt_of_repeated_users; 

dump rec_each; 

輸出:

(12344,2) 
(12345,1)