我已經解決了這個問題,供參考:
REGISTER piggybank.jar
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();
raw = LOAD 'nasa' USING org.apache.hcatalog.pig.HCatLoader(); --cast the data, to make possible the usage of string functions
rawCasted = FOREACH raw GENERATE (chararray)host as host, (chararray)xdate as xdate,(chararray)address as address; --cut out the date, and put together the used columns
rawParsed = FOREACH rawCasted GENERATE host, SUBSTRING(xdate,1,20) as xdate, address; --make sure that the not full columns are omitted
rawFiltered = FILTER rawParsed BY xdate IS NOT NULL; --cast the timestamp to timestamp format
analysisTable = FOREACH rawFiltered GENERATE host, ToDate(xdate, 'dd/MMM/yyyy:HH:mm:ss') as xdate, address;
aTgrouped = GROUP analysisTable BY host;
resultsB = FOREACH aTgrouped {
elems=ORDER analysisTable BY xdate DESC;
two=LIMIT elems 2; --Choose the last two page
fstB=ORDER two BY xdate DESC;
fst=LIMIT fstB 1; --Choose the last page
sndB=ORDER two BY xdate ASC;
snd=LIMIT sndB 1; --Choose the previous page
GENERATE FLATTEN(group), fst.address, snd.address; --Put together the pages
};
DUMP resultsB;
我有一個小的進步,現在我有行(鑄造,日期是日期): (主機,日期,地址) 從這,我怎樣才能爲每個主機選擇最後兩個地址? 感謝提前。 – alfayadd