2013-12-19 95 views
3

我有一個腳本加載活動場地的一些數據:豬 - 傳遞Databag到UDF構造

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray); 

然後我想創建UDF其中有被接受的場地類型的構造函數。

於是,我就定義了這個UDF這樣的:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues); 

這裏是實際的UDF:

public class GenerateVenues extends EvalFunc<Tuple> { 

    TupleFactory mTupleFactory = TupleFactory.getInstance(); 
    BagFactory mBagFactory = BagFactory.getInstance(); 

    private static final String ALLCHARS = "(.*)"; 
    private ArrayList<String> venues; 

    private String regex; 

    public GenerateVenues(DataBag venuesBag) { 
     Iterator<Tuple> it = venuesBag.iterator(); 
     venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!! 
     String current = ""; 
     regex = ""; 
     while (it.hasNext()){ 
      Tuple t = it.next(); 
      try { 
       current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")"; 
       venues.add((String) t.get(0)); 
      } catch (ExecException e) { 
       throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value"); 
      } 
      regex += current + (it.hasNext() ? "|" : ""); 
     } 
    } 

    @Override 
    public Tuple exec(Tuple tuple) throws IOException { 
     // expect one string 
     if (tuple == null || tuple.size() != 2) { 
      throw new IllegalArgumentException(
        "BagTupleExampleUDF: requires two input parameters."); 
     } 
     try { 
      String tweet = (String) tuple.get(0); 
      for (String venue: venues) 
      { 
       if (tweet.matches(ALLCHARS + venue + ALLCHARS)) 
       { 
        Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue)); 
        return output; 
       } 
      } 
      return null; 
     } catch (Exception e) { 
      throw new IOException(
        "BagTupleExampleUDF: caught exception processing input.", e); 
     } 
    } 
} 

當執行腳本是在DEFINE部分只是(venues);之前發射的錯誤:

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN 

顯然我做錯了什麼,ca你幫我弄清楚什麼是錯的。 UDF是否無法接受場地關係作爲參數?或者關係不是DataBag這樣的public GenerateVenues(DataBag venuesBag)? 謝謝!

PS我正在使用Pig版本0.11.1.1.3.0.0-107

回答

0

您不能在UDF構造函數中使用關係作爲參數。只有字符串可以作爲參數傳遞,如果它們確實是另一種類型,則必須在構造函數中解析它們。

4

正如@WinnieNicklaus已經說過的,你可以只用將字符串傳遞給UDF構造函數。

話雖如此,解決您的問題是使用分佈式緩存,您需要覆蓋public List<String> getCacheFiles()以返回將通過分佈式緩存提供的文件名列表。有了這個,你可以讀取文件作爲本地文件並建立你的表。

缺點是豬沒有初始化函數,所以你必須實行類似

private void init() { 
    if (!this.initialized) { 
     // read table 
    } 
} 

,然後調用從exec的第一件事。