2013-07-13 78 views
0

多個頂級記錄我想選擇多個記錄,分享來自一組相同的密鑰,但不知道如何過濾此。豬 - 選擇組

例如,利用下面的數據:

D1,20130701,M1,V1

D1,20130701,M2,V2

D1,20130702,M1,V3

D1,20130703,M1,V4

D1,20130703,M2,V5

D2,20130701,M1,V1

D2,20130702,M1,V3

D2,20130703,M1,V4

和負載語句:

A = load '/home/hduser/t.csv' 
     using PigStorage(',') 
     as (
      device:chararray, 
      dt:chararray, 
      metric:chararray, 
      value:chararray 
     ); 

C = group A by (device, dt); 

產生:

((D1,20130701),{(D1,20130701,男1,V1),(D1,20130701,M2,V2)})

((D1,20130702),{(D1,20130702,M1,V3)})

((D1,20130703) ,{(D1,20130703,M1,V4),(D1,20130703,M2,V5)})

((D2,20130701),{(D2,20130701,M1,V1)})

((D2,20130702),{(D2,20130702,M1,V3)})

((D2,20130703),{(D2,20130703,M1,V4)})

問題是,我應該怎麼做,以過濾掉,這樣我只得到粗體顯示的,邏輯是每個設備(D1/D2 ......),給我最低的日行?

如果我組僅由裝置:

B = group A by device; 

我得到以下兩行:

(D1,{(D1,20130701,M1,V1),(D1, 20130701,M2,V2),(D1,20130702,M1,V3),(D1,20130703,M1,V4),(D1,20130703,M2,V5)})

(D2,{(D2,20130701 ,M1,V1),(D2,20130702,M1,V3),(D2,20130703,M1,V4)})

但我不能在foreach使用限制爲每個設備的記錄數是可變的。

有什麼想法?相當新的豬!

非常感謝。

回答

0

一種方法是

records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (
     device:chararray, 
     dt:int, 
     metric:chararray, 
     value:chararray); 


records_group = group records by (device); 

with_min = FOREACH records_group 
     GENERATE 
     FLATTEN(records), MIN(records.dt) ; 

filterRecords = filter with_min by ($1 == $4); 

I/P是

D1 20130701 M1 V1 D1 20130701 M2 V2

D1 20130702 M1 V3

D1 20130703 M1 V4

D1 20130703 M2 V5

D2 20130702 M1 V3

D2 20130703 M1 V4

輸出是

(D1,20130701,M1,V1,20130701)

(D1,20130701,M2,V2,20130701 )

(D2,20130702,M1,V3,20130702)

+0

完美的納格,非常感謝你!給出我以後的確切內容。 – Vinay