2017-03-04 134 views
6

我遵循Athena getting started guide並嘗試解析我自己的Cloudfront日誌。但是,這些字段沒有被解析。亞馬遜雅典娜不解析雲端日誌

我用一個小測試文件,內容如下:

#Version: 1.0 
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type 
2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error 
2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss 

而與此SQL創建的表:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
    `Date` DATE, 
    Time STRING, 
    Location STRING, 
    Bytes INT, 
    RequestIP STRING, 
    Method STRING, 
    Host STRING, 
    Uri STRING, 
    Status INT, 
    Referrer STRING, 
    os STRING, 
    Browser STRING, 
    BrowserVersion STRING 
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES (
    "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" 
) LOCATION 's3://test/athena-csv/' 

但沒有數據回來:

athena screen shot with no data

我可以看到它返回4行,但前兩個應該排除,因爲他們sta rt與#,所以這就像正則表達式不被正確解析。

我做錯了什麼?或者是正則表達式錯誤(似乎不太可能,因爲它在文檔中,對我來說看起來很好)?

回答

8

這是我結束了:

CREATE EXTERNAL TABLE logs (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_requiest_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` string, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string, 
    `protocol` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)' 
) LOCATION 's3://logs' 

注意雙反斜線是故意的。

雲端日誌的格式在某些時候更改以添加。這處理較舊和較新的文件。

+0

就像一個魅力。他們添加了一個新的GUI,所以它是相同的,期望他們現在有一個嚮導,允許你粘貼列的列表,如下所示: 'date date,time string,location string,bytes int,request_ip string,method string,主機字符串uri字符串狀態int引用字符串useragent字符串uri_query字符串cookie字符串edge_type字符串edget_requiest_id字符串host_header字符串cs_protocol字符串cs_bytes int time_taken字符串x_forwarded_for字符串ssl_protocol字符串ssl_cipher字符串result_type字符串,協議字符串' –

0

該演示也不適用於我。玩了一下後,我得到了以下工作:

CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_requiest_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` string, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'serialization.format' = '1', 
    'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$' 
) LOCATION 's3://bucket/logs/'; 

用您的信息替換桶/日誌和dbname.table。由於某些原因,它仍然爲#行插入空行,但我得到了其餘的數據。

我認爲下一步就是嘗試爲用戶代理或cookie創建一個。

+0

我不得不調整它以與您的工作。但現在應該是好的。 注意:\ \ s應該是\ s如果你有問題複製/粘貼 – CoderDan

1

拉我的頭髮了這一點,並提高對@CoderDans後回答:

祕訣是使用的\ t值分離,而不是\ S爲正則表達式。

CREATE EXTERNAL TABLE IF NOT EXISTS mytablename (
    `date` date, 
    `time` string, 
    `location` string, 
    `bytes` int, 
    `request_ip` string, 
    `method` string, 
    `host` string, 
    `uri` string, 
    `status` int, 
    `referer` string, 
    `useragent` string, 
    `uri_query` string, 
    `cookie` string, 
    `edge_type` string, 
    `edget_request_id` string, 
    `host_header` string, 
    `cs_protocol` string, 
    `cs_bytes` int, 
    `time_taken` int, 
    `x_forwarded_for` string, 
    `ssl_protocol` string, 
    `ssl_cipher` string, 
    `result_type` string, 
    `protocol_version` string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    'serialization.format' = '1', 
    'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$' 
) LOCATION 's3://mybucket/myprefix/'; 
+0

謝謝Gregor。實際上'\ s'和'\ t'一樣好,雖然都需要兩個反斜槓。而不是'([^ \ t] +)','([\ S] +)'也可以。 – andrewrjones

+0

真的嗎?對我來說,\ s沒有工作,但沒有。根據規格,格式是製表符分隔。 –

2

實際上,這裏的所有答案都有一個小錯誤:第4個字段必須是BIGINT,而不是INT。否則,您的> 2GB文件請求未正確解析。經過與AWS Business Support的長時間討論,看起來正確的格式是:

CREATE EXTERNAL TABLE your_table_name (
    `Date` DATE, 
    Time STRING, 
    Location STRING, 
    SCBytes BIGINT, 
    RequestIP STRING, 
    Method STRING, 
    Host STRING, 
    Uri STRING, 
    Status INT, 
    Referrer STRING, 
    UserAgent STRING, 
    UriQS STRING, 
    Cookie STRING, 
    ResultType STRING, 
    RequestId STRING, 
    HostHeader STRING, 
    Protocol STRING, 
    CSBytes BIGINT, 
    TimeTaken FLOAT, 
    XForwardFor STRING, 
    SSLProtocol STRING, 
    SSLCipher STRING, 
    ResponseResultType STRING, 
    CSProtocolVersion STRING 
) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    LOCATION 's3://path_to_your_data_directory' 
+1

在此頁面外,這是唯一對我有效的人。 –