2014-06-24 156 views
0

我正在嘗試生成聚合輸出。問題是所有的數據都會被放入一個reducer中(Filter和Count會產生一個問題)。我如何優化下面的腳本?優化豬腳本

預期輸出: 組,10,2,12,34 ...

data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray); 

grp1 = GROUP data BY UA PARALLEL 50; 
fr1 = FOREACH grp1 { 
     fltrCol1 = FILTER data BY Col1 == 'Other'; 
     fltrCol2 = FILTER data BY Col2 == 'Other'; 
     fltrCol3 = FILTER data BY Col3 == 'Other'; 
     fltrCol4 = FILTER data BY col4 == 'Other'; 
     fltrCol5 = FILTER data BY col5 == 'Other'; 
     cnt_fltrCol1 = COUNT(fltrCol1); 
     cnt_fltrCol2 = COUNT(fltrCol2); 
     cnt_fltrCol3 = COUNT(fltrCol3); 
     cnt_fltrCol4 = COUNT(fltrCol4); 
     cnt_fltrCol5 = COUNT(fltrCol5); 
     GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5; 
} 

回答

1

您可以通過添加fltrCol把過濾邏輯組之前{1,2,3,4,5}列作爲整數,而不是總結它們。從我的頭頂上是腳本:

data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray); 

    filter = FOREACH data GENERATE UA, 
     ((Col1 == 'Other') ? 1 : 0) as fltrCol1, 
     ((Col2 == 'Other') ? 1 : 0) as fltrCol2, 
     ((Col3 == 'Other') ? 1 : 0) as fltrCol3, 
     ((Col4 == 'Other') ? 1 : 0) as fltrCol4, 
     ((Col5 == 'Other') ? 1 : 0) as fltrCol5; 

    grp1 = GROUP data BY UA PARALLEL 50; 

    fr1 = FOREACH grp1 { 
      cnt_fltrCol1 = SUM(fltrCol1); 
      cnt_fltrCol2 = SUM(fltrCol2); 
      cnt_fltrCol3 = SUM(fltrCol3); 
      cnt_fltrCol4 = SUM(fltrCol4); 
      cnt_fltrCol5 = SUM(fltrCol5); 
      GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5; 
    } 
+0

謝謝亞歷克斯。有一些數據問題,現在它工作正常。我會實現你的想法來優化。 – Arun