比方說你有這個數據(在這裏取樣):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
執行以下步驟:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
如果需要,改變日期時間來charrary。
注:有不同的方式這樣做的,什麼我顯示,除了負載階躍,它產生2個步驟所需的結果:GROUP和foreach。
將這項工作如果我有多個記錄,像這樣用不同勢最小eff_dt和最大canc_dt? – pd123
@ pd123嘗試一下,看看自己 –
謝謝。工作。也想出了一條新路。 – pd123