2017-04-04 21 views
-2

模式:以MIN EFF_DT和MAX_CANC_dt從數據PIG

TYP|ID|RECORD|SEX|EFF_DT|CANC_DT 

DMF|1234567|98765432|M|2011-08-30|9999-12-31 
DMF|1234567|98765432|M|2011-04-30|9999-12-31 
DMF|1234567|98765432|M|2011-04-30|9999-12-31 

假設我有多個記錄,像這樣的。我只是想顯示具有最小eff_dt和最大的取消日期的記錄。

我只希望只顯示該記錄1

DMF|1234567|98765432|M|2011-04-30|9999-12-31 

謝謝

回答

0

獲取分鐘eff_dt和最大canc_dt,並用它來過濾relation.Assuming你有關係的

B = GROUP A ALL; 
X = FOREACH B GENERATE MIN(A.EFF_DT); 
Y = FOREACH B GENERATE MAX(A.CANC_DT); 

C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0)); 
D = DISTINCT C; 
DUMP D; 
+0

將這項工作如果我有多個記錄,像這樣用不同勢最小eff_dt和最大canc_dt? – pd123

+0

@ pd123嘗試一下,看看自己 –

+0

謝謝。工作。也想出了一條新路。 – pd123

0

比方說你有這個數據(在這裏取樣):

DMF|1234567|98765432|M|2011-08-30|9999-12-31 
DMF|1234567|98765432|M|2011-04-30|9999-12-31 
DMF|1234567|98765432|M|2011-04-30|9999-12-31 
DMX|1234567|98765432|M|2011-12-30|9999-12-31 
DMX|1234567|98765432|M|2011-04-30|9999-12-31 
DMX|1234567|98765432|M|2011-04-01|9999-12-31 

執行以下步驟:

-- 1. Read data, if you have not 
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime); 

-- 2. Group data by the attribute you like to, in this case it is TYP 
grouped = group A by typ; 

-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields 
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt; 

-- 
dump min_max; 
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z) 
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z) 

如果需要,改變日期時間來charrary。

注:有不同的方式這樣做的,什麼我顯示,除了負載階躍,它產生2個步驟所需的結果:GROUP和foreach。