2016-04-19 66 views
-1

我有一個如下所示的數據文件,指示訂單有效或無效。我想計算有效訂單的計數和無效訂單的計數。如何統計Pig和Hive中的列中的不同值

1,flipkart,pepsi,invalid 
2,flipkart,tshirt,valid 
3,flipkart,shirt,valid 
4,amazon,shoe,valid 
5,amazon,beer,invalid 
6,flipkart,jewels,valid 
7,flipkart,coke,invalid 

所以最終的輸出應該像

  1. 多少數量的有效和無效的記錄完全

    如:有效:7,無效3

  2. 在flipkart,多少有效和無效記錄,以及亞馬遜多少有效和無效記錄。

    例如:Flipkart:有效3,無效:2 亞馬遜:有效1,無效:1

+1

哪裏是你的豬腳本? –

回答

0

在PIG - 做groupByForEach

假設列名id,name,pp,state

byNameState = GROUP my_data BY (Name, State); 
byNameStateCounts = FOREACH byNameState GENERATE 
COUNT(my_data) AS ccc; 
0

您可以嘗試爲您預期的輸出下面的腳本:

回答問題1:

a = load'/home/abhijit/Downloads/movies.txt' USING PigStorage(',') AS (id:int,companyName:chararray,item:chararray,state:chararray); 

Dump a; 

(1,flipkart,pepsi,invalid) 
(2,flipkart,tshirt,valid) 
(3,flipkart,shirt,valid) 
(4,amazon,shoe,valid) 
(5,amazon,beer,invalid) 
(6,flipkart,jewels,valid) 
(7,flipkart,coke,invalid) 

grp = group a by state; 
dump grp; 

(valid,{(2,flipkart,tshirt,valid),(3,flipkart,shirt,valid),(4,amazon,shoe,valid),(6,flipkart,jewels,valid)}) 
(invalid,{(1,flipkart,pepsi,invalid),(5,amazon,beer,invalid),(7,flipkart,coke,invalid)}) 

cnt = foreach grp generate $0, COUNT($1); 
dump cnt; 

(valid,4) 
(invalid,3) 

回答問題2:

grp2 = group a by (companyName,state); 
dump grp2; 

((amazon,valid),{(4,amazon,shoe,valid)}) 
((amazon,invalid),{(5,amazon,beer,invalid)}) 
((flipkart,valid),{(2,flipkart,tshirt,valid),(3,flipkart,shirt,valid),(6,flipkart,jewels,valid)}) 
((flipkart,invalid),{(1,flipkart,pepsi,invalid),(7,flipkart,coke,invalid)}) 


cnt2 = foreach grp2 generate $0, COUNT($1); 
dump cnt2; 

((amazon,valid),1) 
((amazon,invalid),1) 
((flipkart,valid),3) 
((flipkart,invalid),2)