2015-10-19 48 views
1

我想基於降序排列三場一包內的元組進行排序..排序元組基於多的Fileds

例:假設我有以下的包創建的分組:

{(s,3,my),(w,7,pr),(q,2,je)} 

我想排序在$ 0,$ 1,$ 2字段的上述分組袋中的元組,這樣首先它將排序所有元組中的$ 0。它將選擇最大$ 0值的元組。如果所有元組的$ 0都是相同的,那麼它將按$ 1排序,依此類推。

排序應通過迭代過程爲所有分組的行李。

假設,如果我們有databag類似:

{(21,25,34),(21,28,64),(21,25,52)} 

然後根據需要輸出應該是這樣的:

{(21,25,34),(21,25,52),(21,28,64)} 

請讓我知道如果你需要任何更多的澄清

+0

那麼你的輸出應該如何呢? –

+0

上述數據包所需的輸出爲{(q,2,je),(s,3,my),(w,7,pr)} ..但是假設我們有數據包像{(21,25 ,(34),(21,28,64),(21,25,52)}然後根據需求輸出應該是{(21,25,34),(21,25,52),(21,28 ,64)} ..請讓我知道你是否需要更多的澄清。 –

+0

已將評論的預期輸出添加到問題 –

回答

1

訂購你的元組嵌套foreach。這將工作。

輸入:

(1,s,3,my) 
(1,w,7,pr) 
(1,q,2,je) 


A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray); 
B = GROUP A BY a;                        
C = FOREACH B GENERATE A;                      
D = FOREACH C {                        
od = ORDER A BY b, c, d;                      
GENERATE od;                         
}; 

DUMP C測試結果(這類似於你的數據):

({(1,s,3,my),(1,w,7,pr),(1,q,2,je)}) 

輸出:

({(1,q,2,je),(1,s,3,my),(1,w,7,pr)}) 

這將爲所有的情況下工作。

生成具有最高值的元組:

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray); 
B = GROUP A BY a;                        
C = FOREACH B GENERATE A;                      
D = FOREACH C { 
od = ORDER A BY b desc , c desc , d desc; 
od1 = LIMIT od 1;       
GENERATE od1;        
}; 
dump D; 

生成具有最高值的元組,如果所有的三個區域是不同的,如果所有的記錄都相同,或者如果場1和場2都相同,則全部歸還元組。

A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:chararray,c:chararray,d:chararray); 
B = GROUP A BY a;                        
C = FOREACH B GENERATE A; 
F = RANK C; //rank used to separate out the value if two tuples are same          
R = FOREACH F {  
dis = distinct A;          
GENERATE rank_C,COUNT(dis) AS (cnt:long),A;     
}; 
R3 = FILTER R BY cnt!=1; // filter if all the tuples are same 
R4 = FOREACH R3 {       
fil1 = ORDER A by b desc, c desc, d desc; 
fil2 = LIMIT fil1 1;      
GENERATE rank_C,fil2;        
}; // find largest tuple except if all the tuples are same. 
R5 = FILTER R BY cnt==1; // only contains if all the tuples are same 
R6 = FOREACH R5 GENERATE A ; // generate required fields 
F1 = FOREACH F GENERATE rank_C,FLATTEN(A); 
F2 = GROUP F1 BY (rank_C, A::b, A::c); // group by field 1,field 2 
F3 = FOREACH F2 GENERATE COUNT(F1) AS (cnt1:long) ,F1; // if count = 2 then Tuples are same on field 1 and field 2 
F4 = FILTER F3 BY cnt1==2; //separate that alone 
F5 = FOREACH F4 {      
DIS = distinct F1;     
GENERATE flatten(DIS); 
}; 
F8 = JOIN F BY rank_C, F5 by rank_C; 
F9 = FOREACH F8 GENERATE F::A; 
Z = cross R4,F5; // cross done to genearte if all the tuples are different 
Z1 = FILTER Z BY R4::rank_C!=F5::DIS::rank_C; 
Z2 = FOREACH Z1 GENERATE FLATTEN(R4::fil2); 
res = UNION Z2,R6,F9; // Z2 - contains value if all the three fields in the tuple are diff holds highest value, 
//R6 - contains value if all the three fields in the tuple are same 
//F9 - conatains if two fields of the tuples are same 
dump res; 
+0

感謝您的幫助。我對它有一個更多的要求。對於上面的示例,我需要找出$ 0上的最高值的元組。如果$ 0對於所有元組是相同的,那麼取數據庫{(21,25,34),(21,25,52),(21,28,64)}爲$ 1.So的最高值的元組,輸出爲{(21,28,64) }。如果所有的$ 0,$ 1和$ 2字段都是相同的,那麼它應該返回所有的元組。 –

+0

編輯答案。有一點需要注意的是,如果所有三個字段都是相同的,那麼你只會得到一個元組。如果該帖子解析了您的查詢,請接受答案。 –

+0

感謝一噸Vignesh –