2012-06-06 73 views
3

我在我的智慧結尾試圖解決這個問題。我有腳本和UDF與豬0.8.1完美運行,但是當我嘗試用豬0.10.0運行,我得到:由於改變豬版本0.10.0導致Apache Pig錯誤2218

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2218: Invalid resource schema: bag schema must have tuple as its field 

從豬腳本調用UDF的代碼如下所示:

parsed = LOAD '$INPUT' 
    USING pignlproc.storage.ParsingWikipediaLoader('$LANG') 
    AS (title, id, pageUrl, text, redirect, links, headers, paragraphs); 

的ParsingWikipediaLoader類實現LoadMetaData和的getSchema()方法是這樣的:

public ResourceSchema getSchema(String location, Job job) 
     throws IOException { 
    Schema schema = new Schema(); 
    schema.add(new FieldSchema("title", DataType.CHARARRAY)); 
    schema.add(new FieldSchema("id", DataType.CHARARRAY)); 
    schema.add(new FieldSchema("uri", DataType.CHARARRAY)); 
    schema.add(new FieldSchema("text", DataType.CHARARRAY)); 
    schema.add(new FieldSchema("redirect", DataType.CHARARRAY)); 
    Schema linkInfoSchema = new Schema(); 
    linkInfoSchema.add(new FieldSchema("target", DataType.CHARARRAY)); 
    linkInfoSchema.add(new FieldSchema("begin", DataType.INTEGER)); 
    linkInfoSchema.add(new FieldSchema("end", DataType.INTEGER)); 
    schema.add(new FieldSchema("links", linkInfoSchema, DataType.BAG)); 
    Schema headerInfoSchema = new Schema(); 
    headerInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY)); 
    headerInfoSchema.add(new FieldSchema("begin", DataType.INTEGER)); 
    headerInfoSchema.add(new FieldSchema("end", DataType.INTEGER)); 
    schema.add(new FieldSchema("headers", headerInfoSchema, DataType.BAG)); 
    Schema paragraphInfoSchema = new Schema(); 
    paragraphInfoSchema.add(new FieldSchema("tagname", DataType.CHARARRAY)); 
    paragraphInfoSchema.add(new FieldSchema("begin", DataType.INTEGER)); 
    paragraphInfoSchema.add(new FieldSchema("end", DataType.INTEGER)); 
    schema.add(new FieldSchema("paragraphs", paragraphInfoSchema, 
      DataType.BAG)); 

    return new ResourceSchema(schema); 
} 

同樣,腳本和UDF工作與豬0.8.1預期,因此這版本之間必須有所不同。我已經徹底搜索了,但在文檔或Stack Overflow中找不到關於此的任何信息。

回答

2

看起來不同的是在ResourceFieldSchema構造函數。

0.8.1檢測到Bag並將內部模式封裝在元組中,而此邏輯已從0.10.0中刪除。我想你需要修改你的架構定義來包裝袋架構在一個元組:

schema.add(new FieldSchema("links", new Schema(
    new FieldSchema("t", linkInfoSchema)), DataType.BAG)); 

在0.8.1使用時,這不但是產生元組的元組類似的模式:

  • 0.10.0:{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}
  • 0.8.1:{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (t: (target: chararray,begin: int,end: int))},headers: {t: (t: (tagname: chararray,begin: int,end: int))},paragraphs: {t: (t: (tagname: chararray,begin: int,end: int))}}

您可以通過修改這兩個級別的訪問所需的標誌,以真正解決這個問題:

Schema linkInfoSchema = new Schema(); 
    linkInfoSchema.add(new FieldSchema("target", DataType.CHARARRAY)); 
    linkInfoSchema.add(new FieldSchema("begin", DataType.INTEGER)); 
    linkInfoSchema.add(new FieldSchema("end", DataType.INTEGER)); 
    Schema linkInfoSchemaTupleWrapper = new Schema(new FieldSchema("t", 
      linkInfoSchema)); 
    linkInfoSchemaTupleWrapper.setTwoLevelAccessRequired(true); 
    schema.add(new FieldSchema("links", linkInfoSchemaTupleWrapper, DataType.BAG)); 

然後產生0.10.0和0.8.1之間的相同模式:

{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}

{title: chararray,id: chararray,uri: chararray,text: chararray,redirect: chararray,links: {t: (target: chararray,begin: int,end: int)},headers: {t: (tagname: chararray,begin: int,end: int)},paragraphs: {t: (tagname: chararray,begin: int,end: int)}}

0.10.0

/** 
    * Construct using a {@link org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema} as the template. 
    * @param fieldSchema fieldSchema to copy from 
    */ 
    public ResourceFieldSchema(FieldSchema fieldSchema) { 
     type = fieldSchema.type; 
     name = fieldSchema.alias; 
     description = "autogenerated from Pig Field Schema"; 
     Schema inner = fieldSchema.schema; 

     // allow partial schema 
     if ((type == DataType.BAG || type == DataType.TUPLE || type == DataType.MAP) 
       && inner != null) { 
      schema = new ResourceSchema(inner); 
     } else { 
      schema = null; 
     } 
    } 

0.8 .1

/** 
    * Construct using a {@link org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema} as the template. 
    * @param fieldSchema fieldSchema to copy from 
    */ 
    public ResourceFieldSchema(FieldSchema fieldSchema) { 
     type = fieldSchema.type; 
     name = fieldSchema.alias; 
     description = "autogenerated from Pig Field Schema"; 
     Schema inner = fieldSchema.schema; 
     if (type == DataType.BAG && fieldSchema.schema != null 
       && !fieldSchema.schema.isTwoLevelAccessRequired()) { 
      log.info("Insert two-level access to Resource Schema"); 
      FieldSchema fs = new FieldSchema("t", fieldSchema.schema); 
      inner = new Schema(fs);     
     } 

     // allow partial schema 
     if ((type == DataType.BAG || type == DataType.TUPLE) 
       && inner != null) { 
      schema = new ResourceSchema(inner); 
     } else { 
      schema = null; 
     } 
    } 
+0

非常感謝!我將每個包裹在一個Tuple中,現在它就像一個魅力一樣。這個解決方案實際上已經發生在我身上,但是我已經讓包裝稍微錯了,所以腳本中的所有別名都被搞砸了。做得好! – chokamp

相關問題