總體規劃是:您可以按id編組夫婦,然後執行COUNT
,然後對0123ID進行潛在ID和輸出的左連接。從那裏你可以根據需要進行格式化。代碼應該解釋如何更詳細地做到這一點。
注意:如果你需要我進入更多的細節只是讓我知道,但我認爲這些評論應該解釋發生了什麼很好。
-- B generates the count of the number of occurrences of an id in couple
B = FOREACH (GROUP couples BY id)
-- Output and schema of the group is:
-- {group: chararray,couples: {(id: chararray,value: chararray)}}
-- (1,{(1,a),(1,x)})
-- (2,{(2,y)})
-- COUNT(couples) counts the number of tuples in the bag
GENERATE group AS id, COUNT(couples) AS count ;
-- Now we want to do a LEFT join on potentialIDs and B since it will
-- create nulls for IDs that appear in potentialIDs, but not in B
C = FOREACH (JOIN potentialIDs BY id LEFT, B BY id)
-- The output and schema for the join is:
-- {potentialIDs::id: chararray,B::id: chararray,B::count: long}
-- (1,1,2)
-- (2,2,1)
-- (3,,)
-- Now we pull out only one ID, and convert any NULLs in count to 0s
GENERATE potentialIDs::id, (B::count is NULL?0:B::count) AS count ;
爲C
的模式和輸出是:
C: {potentialIDs::id: chararray,count: long}
(1,2)
(2,1)
(3,0)
如果你不希望disambiguate operator(的::)在C
,你可以改變GENERATE
行:
GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;