2012-09-04 60 views
1

我有以下數據集:如何在豬中生成一定數量的元組?

答:

x1 y z1 
x2 y z2 
x3 y z3 
x43 y z33 
x4 y2 z4 
x5 y2 z5 
x6 y2 z6 
x7 y2 z7 

B:

y 12 
y2 25 

加載:LOAD '$輸入' USING PigStorage()AS(K: chararray,m:chararray,n:chararray); 加載B:LOAD'$ input2'使用PigStorage()AS(o:chararray,p:int);

我在o上加入了m和b。我想要做的是僅爲每個o選擇x個元組。因此,舉例來說,如果x爲2它的結果是:

x1 y z1 
x2 y z2 
x4 y2 z4 
x5 y2 z5 

回答

1

要做到這一點,你需要使用GROUP BY,FOREACH與嵌套LIMIT,比JOIN或協同組。見實施豬0.10,我用你的輸入數據,以獲得指定的輸出:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray); 
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int); 
-- as join will be on m, we need to leave only 2 rows per a value in m. 
group_A = group A by m; 
top_A_x = foreach group_A { 
    top = limit A 2; -- where x = 2 
    generate flatten(top); 
}; 

-- another way to do join, allows us to do left or right joins and checks 
co_join = cogroup top_A_x by (m), B by (o); 
-- filter out records from A that are not in B 
filter_join = filter co_join by IsEmpty(B) == false; 
result = foreach filter_join generate flatten(top_A_x); 

或者你可以只是一個協同組實現它,FOREACH與嵌套LIMIT:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray); 
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int); 

co_join = cogroup A by (m), B by (o); 
filter_join = filter co_join by IsEmpty(B) == false; 
result = foreach filter_join { 
    top = limit A 2; 
--you can limit B as well 
    generate flatten(top); 
};