我有一個AWS IoT規則將傳入的JSON發送到Kinesis Firehose。如何過濾進入AWS Hive表的多行JSON數據
從我的物聯網發佈的JSON數據是全部在一行上 - 例如:
{"count":4950, "dateTime8601": "2017-03-09T17:15:28.314Z"}
在管理界面的IOT測試「測試」部分允許你發佈的消息,默認爲以下(注格式化多-line JSON):
{
"message": "Hello from AWS IoT console"
}
我流的流水到S3,然後通過EMR轉換爲柱狀格式最終由雅典娜使用。
問題是,在轉換爲列格式時,Hive(特別是JSON SerDe)無法處理跨越多行的JSON對象。它會炸燬轉換,而不會轉換好的單行JSON記錄。
我的問題是:
- 你如何設置流水忽略多行JSON?
- 如果不可能,如何告訴Hive在載入表之前刪除換行符,或者至少捕獲異常並嘗試繼續?
我已經開始嘗試定義蜂巢表時忽略畸形的JSON:
DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
count int,
dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
'ignore.malformed.json' = 'true',
"timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://...';
這裏是我的全HQL,做轉換:
--Example of converting to OEX/columnar formats
DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
count int,
dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
'ignore.malformed.json' = 'true',
"timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://bucket.me.com/raw/all-sites/';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='15') location 's3://bucket.me.com/raw/all-sites/2017/03/09/15';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='16') location 's3://bucket.me.com/raw/all-sites/2017/03/09/16';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='17') location 's3://bucket.me.com/raw/all-sites/2017/03/09/17';
DROP TABLE to_orc;
CREATE EXTERNAL TABLE to_orc (
count int,
dateTime8601 timestamp
)
STORED AS ORC
LOCATION 's3://bucket.me.com/orc'
TBLPROPERTIES ("orc.compress"="ZLIB");
INSERT OVERWRITE TABLE to_orc SELECT count,dateTime8601 FROM site_sensor_data_raw where year=2017 AND month=03 AND day=09 AND hour=15;