我尋找可以被輸送到一個正則表達式中的蜂巢RegexSerDe多行記錄匹配
"input.regex"="the regex goes here"
條件的形式是「創建外部表」蜂巢QL的說法是,在文件中的日誌,這些日誌在RegexSerDe必須閱讀有以下幾種形式:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
我使用以下命令來創建外部表的代碼:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
這是事情:
它能夠拉出每個日誌的所有第一條線。但不是其他行有多於一行的日誌。我嘗試了所有鏈接,在末尾用\Z
代替\z
,用^
和\Z
或\z
替換\A
,用$
代替\A
,沒有任何工作。我在output.format.string的%4$s
中錯過了什麼嗎?或者我沒有正確使用正則表達式?
正則表達式能做什麼:
它的時間戳第一,其次是日誌類型(DEBUG
或INFO
或其他),那麼ID
其後內容是什麼(小寫字母,數字和連字符的組合)相匹配,直到找到下一個時間戳,或者直到找到與最後一個日誌條目匹配的輸入結束爲止。我還嘗試在最後添加/m
,在這種情況下,生成的表具有所有NULL值。
你爲什麼不排列那個寶貝? (大聲笑這甚至不是動詞,但stil ...你不能將每個人都設置爲一個數組嗎?那麼第一行將是關鍵0,第二個多行項目將在1,另外兩個在2和3你可以打電話給他們,只要你喜歡) – user1576978