2015-01-17 70 views
0

可在access.log格式的配置可能會像如何根據其格式配置生成正則表達式以匹配access.log?

'$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"' 

有沒有辦法來產生一個正則表達式根據它匹配access.log裏?我可以根據像這樣的真實日誌來編寫正則表達式:

'112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"' 

但我無法用格式config編寫正則表達式。誰能幫忙?

+0

最簡單的選擇是用'(。+?)'替換後面跟隨的字母。 – georg

+0

@georg,但你能知道哪一個是remote_addr,哪一個是time_local?我需要把它們整理出來。 – Jerry

+0

你可以使用命名組爲此:'(?P 。+?) - (?P 。+?)'等 – georg

回答

6

要建立從配置的表達,與命名組取代一樣$xxx配置變量,如(?P<xxx>.*?)和逃避分隔符:

import re 

conf = '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"' 
regex = ''.join(
    '(?P<' + g + '>.*?)' if g else re.escape(c) 
    for g, c in re.findall(r'\$(\w+)|(.)', conf)) 

現在,如果你對陣這個表達式日誌條目:

log = '112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"' 
m = re.match(regex, log) 

您的變量被捕獲在matchObject.groupdict中:

import pprint 
pprint.pprint(m.groupdict()) 

結果:

{'body_bytes_sent': '546849', 
'http_referer': 'http://example.com/video/302/', 
'http_user_agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36', 
'remote_addr': '112.3.194.120', 
'remote_user': '-', 
'request': 'GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1', 
'status': '206', 
'time_local': '17/Jan/2015:20:07:34 +0800'} 

如果在您的日誌配置沒有分隔符,你將不得不使用更具體的子模式,而不僅僅是.*。這可以用類似於此的方式優雅地編碼:

# variable-specific patterns 
patterns = { 
    'remote_addr': r'(\d{1,3}\.){3}\d{1,3}', 
    'body_bytes_sent': r'\d+', 
    # etc 
} 

regex = ''.join(
    '(?P<%s>%s)' % (g, patterns.get(g, '.*?')) if g 
     else re.escape(c) 
    for g, c in re.findall(r'\$(\w+)|(.)', conf)) 
+0

謝謝,這對我有很大的幫助。但我仍然懷疑日誌格式配置是否像'$ remote_addr $ remote_user $ time_local $ request $ status $ body_bytes_sent $ http_referer $ http_user_agent'?正則表達式還能幫忙嗎? – Jerry

+0

那你必須比'。*?'更具體一些。爲每個配置變量使用一個專用表達式,例如,對於IP地址使用'(\ d {1,3} \。){3} \ d {1,3}'或對於bytes_sent使用'\ d +'。 – georg

+0

增加了一個例子。 – georg