2013-12-17 91 views
2

假設我有一個別名transactions與此數據:計數不同的元素

person store spent 
A  S  3.3 
A  S  4.7 
B  S  1.2 
B  T  3.4 

我想找出許多不同的人如何去每家店和多少,他們在那裏度過的:

store visitors revenue 
S  2   9.2 
T  1   3.4 

我希望我能做到一步到位:

stores = foreach (group transactions by store) generate 
    group as store, SUM(transactions.spent) as revenue, 
    COUNT(UNIQUE(transactions.person)) as visitors; 

,但它並不像有這樣事情爲UNIQUE

我堅持兩個步驟?

tr1 = foreach (group transactions by (store,person)) generate 
    group.store as store, SUM(spent) as revenue; 
stores = foreach (group tr1 by store) generate 
    group as store, COUNT(tr1) as visitors, SUM(revenue) as revenue; 

回答

4

這裏有

1兩種方法)使用DISTINCT內置的UDF(不是DISTINCT豬運營商)。對不起,我沒有代碼示例,我不知道它將如何執行。

2)使用嵌套的foreach與DISTINCT操作 是這樣的:

stores = FOREACH (GROUP transactions BY store) { 
    uniqueVisitors = DISTINCT visitors; 
    GENERATE 
     group AS store, 
     COUNT(uniqueVisitors) AS visitors, 
     SUM(revenue) AS revenue; 
} 

關於第二個方法的一個好處是,它不應該禁用COMBINER: http://pig.apache.org/docs/r0.11.1/perf.html#When+the+Combiner+is+Used

3

使用獨特的內置UDF您只需將您的UNIQUE替換爲org.apache.pig.builtin.Distinct,

stores = foreach (group transactions by store) generate 
    group as store, SUM(transactions.spent) as revenue, 
    COUNT(org.apache.pig.builtin.Distinct(transactions.person)) as visitors; 
+0

爲什麼這個UDF沒有在builtins的[page](https://pig.apache.org/docs/r0.14.0/func.html#count)中列出? – Eyal

+1

它在這裏https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/builtin/Distinct.html – dranxo