要建立從配置的表達,與命名組取代一樣$xxx
配置變量,如(?P<xxx>.*?)
和逃避分隔符:
import re
conf = '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'
regex = ''.join(
'(?P<' + g + '>.*?)' if g else re.escape(c)
for g, c in re.findall(r'\$(\w+)|(.)', conf))
現在,如果你對陣這個表達式日誌條目:
log = '112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"'
m = re.match(regex, log)
您的變量被捕獲在matchObject.groupdict
中:
import pprint
pprint.pprint(m.groupdict())
結果:
{'body_bytes_sent': '546849',
'http_referer': 'http://example.com/video/302/',
'http_user_agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
'remote_addr': '112.3.194.120',
'remote_user': '-',
'request': 'GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1',
'status': '206',
'time_local': '17/Jan/2015:20:07:34 +0800'}
如果在您的日誌配置沒有分隔符,你將不得不使用更具體的子模式,而不僅僅是.*
。這可以用類似於此的方式優雅地編碼:
# variable-specific patterns
patterns = {
'remote_addr': r'(\d{1,3}\.){3}\d{1,3}',
'body_bytes_sent': r'\d+',
# etc
}
regex = ''.join(
'(?P<%s>%s)' % (g, patterns.get(g, '.*?')) if g
else re.escape(c)
for g, c in re.findall(r'\$(\w+)|(.)', conf))
最簡單的選擇是用'(。+?)'替換後面跟隨的字母。 – georg
@georg,但你能知道哪一個是remote_addr,哪一個是time_local?我需要把它們整理出來。 – Jerry
你可以使用命名組爲此:'(?P。+?) - (?P 。+?)'等 –
georg