2012-11-06 77 views
1

而不是計算單詞,我需要計數字母。 但我有實現這個使用Apache豬版本0.8.1-cdh3u1Pig Mapreduce來計算連續的字母

考慮以下問題輸入:

989;850;abcccc 
29;395;aabbcc 

的輸出中應該是:

989;850;a;1 
989;850;b;1 
989;850;c;4 
29;395;a;2 
29;395;b;2 
29;395;c;2 

這裏是我的嘗試:

A = LOAD 'input' using PigStorage(';') as (x:int, y:int, content:chararray); 
B = foreach A generate x, y, FLATTEN(STRSPLIT(content, '(?<=.)(?=.)', 6)) as letters; 
C = foreach B generate x, y, FLATTEN(TOBAG(*)) as letters; 
D = foreach C generate x, y, letters.letters as letter; 
E = GROUP D BY (x,y,letter); 
F = foreach E generate group.x as x, group.y as y, group.letter as letter, COUNT(D.letter) as count; 

A,B和C可以傾倒,但是「dump D」導致「ER ROR 2997:無法從後備錯誤重新創建異常:java.lang.ClassCastException:java.lang.Integer無法轉換爲org.apache.pig.data.Tuple「

轉儲C顯示(儘管第三個值是怪異的元組):

(989,850,a) 
(989,850,b) 
(989,850,c) 
(989,850,c) 
(989,850,c) 
(989,850,c) 
(29,395,a) 
(29,395,a) 
(29,395,b) 
(29,395,b) 
(29,395,c) 
(29,395,c) 

下面是模式:

grunt> describe A; describe B; describe C; describe D; describe E; describe F; 
A: {x: int,y: int,content: chararray} 
B: {x: int,y: int,letters: bytearray} 
C: {x: int,y: int,letters: (x: int,y: int,letters: bytearray)} 
D: {x: int,y: int,letter: bytearray} 
E: {group: (x: int,y: int,letter: bytearray),D: {x: int,y: int,letter: bytearray}} 
F: {x: int,y: int,letter: bytearray,count: long} 

這豬的版本似乎不支持TOBAG($ 2 .. $ 8),因此TOBAG(*),其中還包括X和y,但這可能是後來synactically排序... 我想避免寫一個UDF,否則我會直接使用Java API。

但我並沒有真正得到演員的錯誤。有人可以解釋一下嗎?

回答

0

我建議改爲寫custom UDF。一個快速的,原始的實現應該是這樣的:

package com.example; 

import java.io.IOException; 
import java.util.HashMap; 
import java.util.Map; 

import org.apache.pig.EvalFunc; 
import org.apache.pig.data.BagFactory; 
import org.apache.pig.data.DataBag; 
import org.apache.pig.data.DataType; 
import org.apache.pig.data.Tuple; 
import org.apache.pig.data.TupleFactory; 
import org.apache.pig.impl.logicalLayer.schema.Schema; 

public class CharacterCount extends EvalFunc<DataBag> { 

    private static final BagFactory bagFactory = BagFactory.getInstance(); 
    private static final TupleFactory tupleFactory = TupleFactory.getInstance(); 

    @Override 
    public DataBag exec(Tuple input) throws IOException { 
     try { 

      Map<Character, Integer> charMap = new HashMap<Character, Integer>(); 

      DataBag result = bagFactory.newDefaultBag(); 
      int x = (Integer) input.get(0); 
      int y = (Integer) input.get(1); 
      String content = (String) input.get(2); 

      for (int i = 0; i < content.length(); i++){ 
       char c = content.charAt(i);   
       Integer count = charMap.get(c); 
       count = (count == null) ? 1 : count + 1; 
       charMap.put(c, count); 
      } 

      for (Map.Entry<Character, Integer> entry : charMap.entrySet()) { 
       Tuple res = tupleFactory.newTuple(4); 
       res.set(0, x); 
       res.set(1, y); 
       res.set(2, String.valueOf(entry.getKey())); 
       res.set(3, entry.getValue()); 
       result.add(res); 
      } 

      return result; 

     } catch (Exception e) { 
      throw new RuntimeException("CharacterCount error", e); 
     } 
    } 

} 

它打包在一個罐子裏,然後執行它:

register '/home/user/test/myjar.jar'; 
A = LOAD '/user/hadoop/store/sample/charcount.txt' using PigStorage(';') 
     as (x:int, y:int, content:chararray); 

B = foreach A generate flatten(com.example.CharacterCount(x,y,content)) 
     as (x:int, y:int, letter:chararray, count:int); 

dump B; 
(989,850,b,1) 
(989,850,c,4) 
(989,850,a,1) 
(29,395,b,2) 
(29,395,c,2) 
(29,395,a,2) 
+0

感謝示例代碼。那麼在「純」豬拉丁0.8中是不可能的? – rretzbach

+1

我不認爲這是不可能的,但是爲了達到結果而使用大量的內置轉換來實現結果可能會更加昂貴(即:最終會有更多的MR作業)。例如:分組將始終強制縮減階段。 –

0

我沒有0.8的版本,但你可以試試這個:

A = LOAD 'input' using PigStorage(';') as (x:int, y:int, content:chararray); 
B = foreach A generate x, y, FLATTEN(STRSPLIT(content, '(?<=.)(?=.)', 6)); 
C = foreach B generate $0 as x, $1 as y, FLATTEN(TOBAG(*)) as letter; 
E = GROUP C BY (x,y,letter); 
F = foreach E generate group.x as x, group.y as y, group.letter as letter, COUNT(C.letter) as count; 
0

你可以試試這個

grunt> a = load 'inputfile.txt' using PigStorage(';') as (c1:chararray, c2:chararray, c3:chararray); 
grunt> b = foreach a generate c1,c2,FLATTEN(TOKENIZE(REPLACE(c3,'','^'),'^')) as split_char; 
grunt> c = group b by (c1,c2,split_char); 
grunt> d = foreach c generate group, COUNT(b); 
grunt> dump d; 

輸出如下所示:

((29,395,a),2) 
((29,395,b),2) 
((29,395,c),2) 
((989,850,a),1) 
((989,850,b),1) 
((989,850,c),4) 
+0

歡迎來到StackOverflow。你能解釋一下你的答案,這樣對其他人更有用嗎?請參閱http://stackoverflow.com/help/how-to-answer – wmk