2013-06-22 65 views
1

我想在豬中做一個星型模式類型的連接,下面是我的代碼。當我加入不同列的多個關係時,我必須在每次加上前一個加入的名稱前加以說明。我相信應該有更好的方法,我無法通過Google搜索找到它。任何指針都會非常有幫助。避免在多關係連接豬的前綴

即爲這樣的列添加前綴「H864 :: H86 :: hs_8_d :: hs_8_desc」是我想要避免的。

hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray); 
hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4,SUBSTRING(hs_8,0,6) as hs_6,hs_8,hs_8_desc; 

hs_6_d = LOAD 'hs_6_distinct' USING PigStorage('^') as (hs_6:chararray,hs_6_desc:chararray); 
hs_4_d = LOAD 'hs_4_distinct' USING PigStorage('^') as (hs_4:chararray,hs_4_desc:chararray); 
hs_2_d = LOAD 'hs_2_distinct' USING PigStorage('^') as (hs_2:chararray,hs_2_desc:chararray); 

H86 = JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated' ; 
H864 = JOIN H86 BY hs_8_d::hs_4, hs_4_d BY hs_4 USING 'replicated' ; 
H8642 = JOIN H864 BY H86::hs_8_d::hs_2, hs_2_d BY hs_2 USING 'replicated' ; 

hs_dim = FOREACH H8642 GENERATE hs_2_d::hs_2,hs_2_d::hs_2_desc,H864::hs_4_d::hs_4,H864::hs_4_d::hs_4_desc,H864::H86::hs_6_d::hs_6,H864::H86::hs_6_d::hs_6_desc,H864::H86::hs_8_d::hs_8,H864::H86::hs_8_d::hs_8_desc; 

回答

2

通過添加額外的foreach到連接,你可以稍微簡化別名。檢查統計數據,這不會將額外的MR作業添加到管道中。原始的和這將產生4個僅地圖作業。

E.g:

H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate 
     hs_8_d::hs_2 as x1, 
     hs_8_d::hs_4 as x2, 
     hs_8_d::hs_6 as x3, 
     hs_8_d::hs_8 as x4, 
     hs_8_d::hs_8_desc as x5, 
     hs_6_d::hs_6 as x6, 
     hs_6_d::hs_6_desc as x7; 

H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate 
      H86::x1 as y1, 
      H86::x2 as y2, 
      H86::x3 as y3, 
      H86::x4 as y4, 
      H86::x5 as y5, 
      H86::x6 as y6, 
      H86::x7 as y7, 
      hs_4_d::hs_4 as y8, 
      hs_4_d::hs_4_desc as y9; 

H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate 
      H864::y1 as z1, 
      H864::y2 as z2, 
      H864::y3 as z3, 
      H864::y4 as z4, 
      H864::y5 as z5, 
      H864::y6 as z6, 
      H864::y7 as z7, 
      H864::y8 as z8, 
      H864::y9 as z9, 
      hs_2_d::hs_2 as z10, 
      hs_2_d::hs_2_desc as z11; 

hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5; 

如果你有一個元組的包,然後DatafuAliasBagFields可能會有所幫助。

0

豬總是在bagname::前添加字段以區分連接後的字段。不幸的是我不認爲你可以避免這種情況。

+0

當我有3個以上的關係時,它變得複雜,我發現很難得出冗長的前綴,你如何處理這種情況?還是有一種簡單的方法來派生前綴?如果我加入20多個關係,我認爲這可能會超級複雜。我是豬新手..很想知道如何處理這個豬 – vumaasha