2014-03-05 37 views
1

我試圖建立以下JSON(Hadoop的)的Avro的模式:創造的Avro架構中的嵌套的記錄簡單的JSON

{ 
    "name_tag":"Guy", 
    "known_nested_structure" : { 
    "fieldA" : ["value1"], 
    "fieldB" : ["value1","value2"], 
    "fieldC" : [], 
    "fieldD" : ["value1"] 
    }, 
    "another_field" : "hi" 
} 

我最初的想法是這樣的Avro架構(包括蜂巢命令):

CREATE EXTERNAL TABLE IF NOT EXISTS record_table 
    PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
    STORED AS 
    INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
    LOCATION 'hdfs://localhost/data/output/records_data/hourly' 
    TBLPROPERTIES ('avro.schema.literal'='{ 
    "name": "myRecord", 
    "type": "record", 
    "fields": [ 
    {"name":"name_tag", "type":"string",c"default": ""}, 
    { 
     "name": "known_nested_structure", 
     "type": "record", 
     "fields": [ 
      {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null}, 
      {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null}, 
      {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null}, 
      {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null} 
     ], 
     "default":null 
    }, 
    {"name": "another_field","type":"string","default": ""} 
    ] 
}'); 

該命令的蜂房結果:從解串器 cannot_determine_schema串 行 error_error_error_error_error_error_error串從解串器 CH從解串器 模式串ECK串從解串器 URL字符串從解串器 和字符串從解串器 文字字符串從解串器 年採取INT 月INT 日INT 小時INT 時間:0.128秒

但由於某些原因這是可用的avro模式。

{ 
    "name": "myRecord", 
    "type": "record", 
    "fields": [ 
    {"name":"name_tag", "type":"string","default": null}, 
    { 
    "name": "known_nested_structure", 
    "type": { 
     "name": "known_nested_structure", 
     "type": "record", 
     "fields": [ 
       {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null}, 
       {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null}, 
       {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null}, 
       {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null} 
       ], 
       "default":null 

     } 
    }, 
     {"name": "another_field","type": "string","default": null} 
    ] 
} 

結果:

OK 
name_tag string from deserializer 
known_nested_structure   struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>>   from deserializer 
another_field string from deserializer 
year int 
month int 
day int 
hour int 
Time taken: 0.123 seconds 

什麼是第一架Avro模式不起作用的原因是什麼?爲什麼我不能直接將記錄作爲字段(known_nested_structure在我的第二個模式示例中的known_nested_structure中)?

謝謝,

蓋伊

回答

4

正如我可以看到AvroSerde使用阿夫羅API和解析模式,它使用org.apache.avro.Schema的parse()方法。如果您查看該方法,您可以清楚地看到它在讀取字段時執行遞歸調用來解析。因此,如果您的字段中有「記錄」,則需要遵循與(name,type =「record」,fields [])順序相同的約定。這就是你的第二個avro工作和第一個失敗的可能原因。 grepcode on org.apache.avro.Schema,它應該解釋。

1

有一個錯誤,我可以在你的方案看(ç默認前):

{"name":"name_tag", "type":"string",c"default": ""}, 

它應該是:

{"name":"name_tag", "type":"string","default": ""}, 
+0

謝謝,這可能是在複製和粘貼。它不在代碼中:) –