我遵循Athena getting started guide並嘗試解析我自己的Cloudfront日誌。但是,這些字段沒有被解析。亞馬遜雅典娜不解析雲端日誌
我用一個小測試文件,內容如下:
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss
而與此SQL創建的表:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
`Date` DATE,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
os STRING,
Browser STRING,
BrowserVersion STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://test/athena-csv/'
但沒有數據回來:
我可以看到它返回4行,但前兩個應該排除,因爲他們sta rt與#,所以這就像正則表達式不被正確解析。
我做錯了什麼?或者是正則表達式錯誤(似乎不太可能,因爲它在文檔中,對我來說看起來很好)?
就像一個魅力。他們添加了一個新的GUI,所以它是相同的,期望他們現在有一個嚮導,允許你粘貼列的列表,如下所示: 'date date,time string,location string,bytes int,request_ip string,method string,主機字符串uri字符串狀態int引用字符串useragent字符串uri_query字符串cookie字符串edge_type字符串edget_requiest_id字符串host_header字符串cs_protocol字符串cs_bytes int time_taken字符串x_forwarded_for字符串ssl_protocol字符串ssl_cipher字符串result_type字符串,協議字符串' –