日誌分析的Apache豬

我有此行的日誌：日誌分析的Apache豬

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

其中第一列（in24.inetnebr.com）是主機，所述第二（01/Aug/1995:00:00:01 -0400）是時間戳，所述第三（GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0）是下載頁。

我如何才能找到Pig的每個主機的最後兩個下載頁面？

非常感謝您的幫助！

來源

2013-12-09 alfayadd

我有一個小的進步，現在我有行（鑄造，日期是日期）：（主機，日期，地址）從這，我怎樣才能爲每個主機選擇最後兩個地址？感謝提前。 – alfayadd

我已經解決了這個問題，供參考：

REGISTER piggybank.jar 
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING(); 

raw = LOAD 'nasa' USING org.apache.hcatalog.pig.HCatLoader(); --cast the data, to make possible the usage of string functions 

rawCasted = FOREACH raw GENERATE (chararray)host as host, (chararray)xdate as xdate,(chararray)address as address; --cut out the date, and put together the used columns 

rawParsed = FOREACH rawCasted GENERATE host, SUBSTRING(xdate,1,20) as xdate, address; --make sure that the not full columns are omitted 

rawFiltered = FILTER rawParsed BY xdate IS NOT NULL; --cast the timestamp to timestamp format 

analysisTable = FOREACH rawFiltered GENERATE host, ToDate(xdate, 'dd/MMM/yyyy:HH:mm:ss') as xdate, address; 

aTgrouped = GROUP analysisTable BY host; 

resultsB = FOREACH aTgrouped { 
elems=ORDER analysisTable BY xdate DESC; 
two=LIMIT elems 2; --Choose the last two page 

fstB=ORDER two BY xdate DESC; 
fst=LIMIT fstB 1; --Choose the last page 

sndB=ORDER two BY xdate ASC; 
snd=LIMIT sndB 1; --Choose the previous page 

GENERATE FLATTEN(group), fst.address, snd.address; --Put together the pages 
}; 
DUMP resultsB;

來源

2013-12-17 19:46:31 alfayadd

我在這個NASA數據集上做了4次分析（兩次使用Pig，兩次使用Hive），如果有人感興趣，我可以提供數據集的鏈接和其他3次分析的代碼。 – alfayadd

日誌分析的Apache豬

回答

相關問題