2014-04-29 88 views
1

這兩級pig處理工程:如何將兩條豬語句合併爲一個?

my_out = foreach (group my_in by id) { 
    grouped = BagGroup(my_in.(keyword,weight),my_in.keyword); 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    grouped as grouped; 
}; 
my_out1 = foreach my_out { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
}; 

然而,當我將它們合併:

my_out = foreach (foreach (group my_in by id) { 
    grouped = BagGroup(my_in.(keyword,weight),my_in.keyword); 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    grouped as grouped; 
    }) { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
    }; 

我得到一個錯誤:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5. 

我的問題是:

  1. 如何避免此錯誤?
  2. 它甚至有道理我正在嘗試做什麼? 即使我設法做到這一點,這將節省我一個MR通行證?

回答

2

一般來說,Pig解析複雜嵌套表達式的能力是不可靠的。另一個常見的錯誤,當嵌套太多處理是ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""

我經常嘗試這樣做,以避免必須拿出一堆別名的名稱,除了作爲計算中的中間步驟沒有意義。但有時候這是不可能的,正如你發現的那樣。我的猜測是嵌套的foreach是不行的。但就你而言,它看起來像第一個嵌套的foreach是沒有必要的。試試這個:

my_out = foreach (foreach (group my_in by id) 
    generate 
    group as id, 
    CountEach(my_in.domain) as domains, 
    BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped 
) { 
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight; 
    generate id, domains, keywords; 
    }; 

關於你的第二個問題,沒有,這將使了最終的MR計劃沒有什麼區別。這純粹是Pig解析腳本的問題;通過以這種方式分組命令,map-reduce邏輯不變。

+1

我得到'ERROR 1000:解析時出錯。詞彙錯誤在第25行第0列。遇到:之後:「」你的代碼 – sds

+0

Darn。那麼你可能會倒黴。但請放心,它不會添加任何map-reduce作業來將語句拆分。 –