2013-10-03 68 views
0

我有TB訪問日誌,我試圖分析歷史參考。我正在嘗試獲取每日唯一身份訪問者數量(IP)和每日點擊次數。訪問日誌豬作業。計數和明顯計數

如果這是在MySQL中我只是做這樣的事情:

SELECT COUNT(1) from access_logs union 
SELECT COUNT(DISTINCT(ip)) from access_logs 

沒有辦法做一個工作類似的東西,所以我沒有做地圖工作第二次?

我知道這不會工作,但這是我正在尋找的功能。

X = (fqdn:chararray,ip:chararray,date:chararray,time:chararray,uri:chararray,ua:chararray); 
Y = COUNT(X); 
Z = COUNT(DISTINCT(X.IP); 
OUT = UNION Y,Z; 
STORE OUT into ... 

回答

0

如果要計算輸入中的所有記錄,則需要使用GROUP ALL,它會創建一個包。當然性能也是個原因,使用蓄電池DISTINCT函數org.apache.pig.builtin.Distinct

X = load 'path' as (fqdn:chararray,ip:chararray,date:chararray,time:chararray,uri:chararray,ua:chararray); 
IPs = FOREACH X GENERATE ip; // project early for performance reasons 
GRP = group IPs all; 
OUT = foreach GRP generate COUNT(IPs) as all_cnt, COUNT(org.apache.pig.builtin.Distinct(IPs.ip)) as distinct_cnt; 

如果你有太多的IP和你記憶有關的異常,比你可以做這樣的事情:

X = load 'path' as (fqdn:chararray,ip:chararray,date:chararray,time:chararray,uri:chararray,ua:chararray); 
IPs = FOREACH X GENERATE ip; // project early for performance reasons 
Dist_IPs = distinct IPs; 
GRP_DIST = group Dist_IPs all; 
DIST = foreach GRP_DIST generate COUNT(GRP_DIST) as cnt, 'dist' as category; 

GRP_ALL = group IPs all; 
ALL = foreach GRP_ALL generate COUNT(GRP_ALL)as cnt, 'all' as category; 

OUT = union DIST, ALL;