2013-06-29 69 views
2

我有2個數據源。一個包含api調用列表,另一個包含所有相關的認證事件。每個Api調用可以有多個Auth事件,我想查找以下驗證事件:
a)包含與Api調用相同的「標識符」
b)Api調用後一秒內發生
c)在上述過濾之後最接近Api調用。Pig Latin(在foreach循環中過濾第2個數據源)

我曾在一個foreach循環通過每個ApiCall事件計劃循環再利用的authevents過濾語句來找到正確的 - 但是,它不會出現,這是可能的(USING Filter in a Nested FOREACH in PIG

會有人能夠建議其他方式來實現這一點。如果有幫助,這裏的豬腳本我試着使用:

apiRequests = LOAD '/Documents/ApiRequests.txt' AS (api_fileName:chararray, api_requestTime:long, api_timeFromLog:chararray, api_call:chararray, api_leadString:chararray, api_xmlPayload:chararray, api_sourceIp:chararray, api_username:chararray, api_identifier:chararray); 
authEvents = LOAD '/Documents/AuthEvents.txt' AS (auth_fileName:chararray, auth_requestTime:long, auth_timeFromLog:chararray, auth_call:chararray, auth_leadString:chararray, auth_xmlPayload:chararray, auth_sourceIp:chararray, auth_username:chararray, auth_identifier:chararray); 
specificApiCall = FILTER apiRequests BY api_call == 'CSGetUser';     -- Get all events for this specific call 
match = foreach specificApiCall {            -- Now try to get the closest mathcing auth event 
     filtered1 = filter authEvents by auth_identifier == api_identifier;  -- Only use auth events that have the same identifier (this will return several) 
     filtered2 = filter filtered1 by (auth_requestTime-api_requestTime)<1000; -- Further refine by usings auth events within a second on the api call's tiime 
     sorted = order filtered2 by auth_requestTime;       -- Get the auth event that's closest to the api call 
     limited = limit sorted 1; 
     generate limited; 
     }; 
dump match; 

回答

1

嵌套FOREACH不是與同時遍歷第一個第二個關係的工作。這是因爲當你的關係有一個袋子,你想用這個袋子工作,就好像它是它自己的關係一樣。您不能同時使用apiRequestsauthEvents,除非您先進行某種連接或分組,以將所需的所有信息放入單個關係中。

你的任務很好地工作在概念上與JOINFILTER,如果你並不需要限制自己一個授權事件:

allPairs = JOIN specificApiCall BY api_identifier, authEvents BY auth_identifier; 
match = FILTER allPairs BY (auth_requestTime-api_requestTime)<1000; 

現在所有的信息是在一起,你可以做其次是GROUP match BY api_identifier一個嵌套的FOREACH挑出一個單一的事件。

但是,如果您使用COGROUP運算符(與JOIN相似但沒有交叉積),則可以在一個步驟中完成此操作 - 您可以從每個關係中得到兩個包含分組記錄的行李。使用此挑選出最近的授權事件:

cogrp = COGROUP specificApiCall BY api_identifier, authEvents BY auth_identifier; 
singleAuth = FOREACH cogrp { 
    auth_sorted = ORDER authEvents BY auth_requestTime; 
    auth_1 = LIMIT auth_sorted 1; 
    GENERATE FLATTEN(specificApiCall), FLATTEN(auth_1); 
    }; 

然後FILTER只留下1秒內的那些:

match = FILTER singleAuth BY (auth_requestTime-api_requestTime)<1000; 
+0

謝謝小熊,我用協同組和它的工作一種享受。你是最好的! – Hinchy