我試過了,它看起來像一個HL文件。你可以使用針,超過&鉛,並拿出這樣的東西。從性能的角度來看,可能有比這更好的解決方案。但是,這應該工作,我想,請讓我知道它是怎麼回事。
DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;
in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray),
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);
這應該產生這樣的:
(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)
(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)
(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)
(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)
(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)
(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)
代替文件名或恆定的,它只是增加了OBR記錄其相應的OBXs。
請幫我,我在這掙扎了很多。 – animal
獨特無效?您的預期結果不涉及所有獨特的記錄。 –
沒有明顯的沒有工作。是的,它沒有所有獨特的記錄。我只想要我展示的那些記錄。我怎樣才能做到這一點。 – animal