Python - 單行與多行REGEX

＃目標：進程報告時間戳，例如： 2011-09-21 15:45:00和succ中的前兩個數據。統計數據線，如：

input_text = ''' 
# Process_Name  (23387) Report at 2011-09-21 15:45:00.001 Type: Periodic #\n 
some line 1\n 
some line 2\n 
some other lines\n 
succ. statistics |  1438  1439 99 | 3782245 3797376 99 |\n 
some lines\n 
Process_Name  (23387) Report at 2011-09-21 15:50:00.001 Type: Periodic #\n 
some line 1\n 
some line 2\n 
some other lines\n 
succ. statistics |  1436  1440 99 | 3782459 3797523 99 |\n 
repeat the pattern several hundred times... 
'''

我迭代線上到線下，當它工作，

def parse_file(file_handler, patterns): 

    results = [] 
    for line in file_handler: 
     for key in patterns.iterkeys(): 
      result = re.match(patterns[key], line) 
      if result: 
       results.append(result) 

return results 

patterns = { 
    'report_date_time': re.compile('^# Process_Name\s*\(\s*\d+\) Report at (.*)\.[0-9] {3}\s+Type:\s*Periodic\s*#\s*.*$'), 
    'serv_term_stats': re.compile('^succ. statistics \|\s+(\d+)\s+ (\d+)+\s+\d+\s+\|\s+\d+\s+\d+\s+\d+\s+\|\s*$'), 
    } 
results = parse_file(fh, patterns)

[('2011-09-21 15:40:00',), 
('1425', '1428'), 
('2011-09-21 15:45:00',), 
('1438', '1439')]

，但有作爲元組輸出的列表中，我目標，

[('2011-09-21 15:40:00','1425', '1428'), 
('2011-09-21 15:45:00', '1438', '1439')]

我試了幾個連擊與初始模式和它們之間的懶惰量詞，但無法弄清楚如何捕捉模式使用正則表達式多

# .+? Lazy quantifier "match as few characters as possible (all characters allowed) until reaching the next expression" 
pattern = '# Process_Name\s*\(\s*\d+\) Report at (.*)\.[0-9]{3}\s+Type:\s*Periodic.*?succ. statistics) \|\s+(\d+)\s+(\d+)+\s+\d+\s+\|\s+\d+\s+\d+\s+\d+\s+\|\s' 
regex = re.compile(pattern, flags=re.MULTILINE) 

data = file_handler.read()  
for match in regex.finditer(data): 
    results = match.groups()

我怎樣才能做到這一點？

來源

2011-09-22 Joao Figueiredo

我沒有給你一個答案，但你爲什麼要在多線串那樣嵌入\ n嗎？字符串中的實際換行符是換行符。 – geoffspear

Right Wooble，這是在Linux中，所以只是添加它們來表示換行符（試圖避免通常是\ n或\ r或\ r \ n？） –

使用re.DOTALL所以.將匹配任何字符，包括換行：

import re 

data = ''' 
# Process_Name  (23387) Report at 2011-09-21 15:45:00.001 Type: Periodic #\n 
some line 1\n 
some line 2\n 
some other lines\n 
succ. statistics |  1438  1439 99 | 3782245 3797376 99 |\n 
some lines\n 
repeat the pattern several hundred times... 
''' 

pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).*?succ. statistics\s+\|\s+(\d+)\s+(\d+)' 
regex = re.compile(pattern, flags=re.MULTILINE|re.DOTALL) 

for match in regex.finditer(data): 
    results = match.groups() 
    print(results) 

    # ('2011-09-21', '1438', '1439')

來源

2011-09-22 15:31:24 unutbu

哇。你很快。感謝您的答案和改進unutbu，並感謝像你這樣的大師的stackoverflow！ –

編輯：一個小小的顛簸，我確實需要保證一個非貪婪的量詞，否則那個正則表達式只會捕獲第一個時間戳，最後一個統計數據，忽略它們之間的上千行。因此，pattern = r'（\ d {4} - \ d {2} - \ d {2} \ d {2}：\ d {2}：\ d {2}）。*？succ。統計\ s + \ | \ s +（\ d +）\ s +（\ d +）' –

@JoaoFigueiredo：啊好點。感謝您的更正。 – unutbu

Python - 單行與多行REGEX

回答

相關問題