2012-07-09 34 views
3

對Pig執行看起來像兩個級別的分組時,我有一個問題。舉個例子,假設我有過一些例子輸入數據:對Pig中不同的值執行計數

email_id:chararray from:chararray  to:bag{recipients:tuple(recipient:chararray)} 
e1     [email protected]  {([email protected]),([email protected]),([email protected])} 
e2     [email protected]  {([email protected]),([email protected])} 
e3     [email protected]  {([email protected])} 
e4     [email protected]  {([email protected]),([email protected])} 

所以每一行是從「從」用戶的電子郵件用戶(S)「到」。

我最終要的所有發件人和所有他們所發送的電子郵件的人,包括每個人發送的電子郵件#,整理從最高到最低的列表,例如:

[email protected]  {([email protected], 2), ([email protected], 1), ([email protected], 1), ([email protected], 1), ([email protected], 1)} 
[email protected]  {([email protected], 1), ([email protected], 1)} 

想要在豬身上解決這個問題,最好的方法是讚賞!

回答

6

這裏是腳本的一個版本:

inpt = load '/pig_data/pig_fun/input/from_senders.txt' as (email_id:chararray, from:chararray, to:bag{recipients:tuple(recipient:chararray)}); 

pivot = foreach inpt generate from, FLATTEN(to); 
pivot = foreach pivot generate from, to::recipient as recipient; 
dump pivot; 
/* 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
([email protected],[email protected]) 
*/ 

grp = group pivot by (from, recipient); 
with_count = foreach grp generate FLATTEN(group), COUNT(pivot) as count; 
dump with_count; 
/* 
([email protected],[email protected],2) 
([email protected],[email protected],1) 
([email protected],[email protected],1) 
([email protected],[email protected],1) 
([email protected],[email protected],1) 
([email protected],[email protected],1) 
([email protected],[email protected],1) 
*/ 

to_bag = group with_count by from; 
result = foreach to_bag { 
    order_by_count = order with_count by count desc; 
    generate group as from, order_by_count.(recipient, count); 
}; 
dump result; 
/* 
([email protected],{([email protected],2),([email protected],1),([email protected],1),([email protected],1),([email protected],1)}) 
([email protected],{([email protected],1),([email protected],1)}) 
*/ 

希望它能幫助。

+0

謝謝alexeipab!這是一個非常明確的解決方案。 – Matan 2012-07-09 17:02:46