我試圖建立以下JSON(Hadoop的)的Avro的模式:創造的Avro架構中的嵌套的記錄簡單的JSON
{
"name_tag":"Guy",
"known_nested_structure" : {
"fieldA" : ["value1"],
"fieldB" : ["value1","value2"],
"fieldC" : [],
"fieldD" : ["value1"]
},
"another_field" : "hi"
}
我最初的想法是這樣的Avro架構(包括蜂巢命令):
CREATE EXTERNAL TABLE IF NOT EXISTS record_table
PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost/data/output/records_data/hourly'
TBLPROPERTIES ('avro.schema.literal'='{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string",c"default": ""},
{
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null
},
{"name": "another_field","type":"string","default": ""}
]
}');
該命令的蜂房結果:從解串器 cannot_determine_schema串 行 error_error_error_error_error_error_error串從解串器 CH從解串器 模式串ECK串從解串器 URL字符串從解串器 和字符串從解串器 文字字符串從解串器 年採取INT 月INT 日INT 小時INT 時間:0.128秒
但由於某些原因這是可用的avro模式。
{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string","default": null},
{
"name": "known_nested_structure",
"type": {
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null
}
},
{"name": "another_field","type": "string","default": null}
]
}
結果:
OK
name_tag string from deserializer
known_nested_structure struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>> from deserializer
another_field string from deserializer
year int
month int
day int
hour int
Time taken: 0.123 seconds
什麼是第一架Avro模式不起作用的原因是什麼?爲什麼我不能直接將記錄作爲字段(known_nested_structure在我的第二個模式示例中的known_nested_structure中)?
謝謝,
蓋伊
謝謝,這可能是在複製和粘貼。它不在代碼中:) –