如何使用Python將文件標記爲一系列正則表達式？

我想將文件解析爲令牌列表。每個標記至少包含一行，但可以包含更多。每個標記都與正則表達式匹配。如果輸入不是一個令牌序列（即沒有垃圾導致，中間或尾隨），我想要發出一個錯誤信號。我不關心內存消耗，因爲輸入文件相對較小。如何使用Python將文件標記爲一系列正則表達式？

在Perl中，我會使用類似（僞代碼）：

$s = slurp_file(); 
while ($s ne '') { 
    if ($s =~ s/^\nsection (\d)\n\n/p) { 
    push (@r, ['SECTION ' . $1, ${^MATCH}]); 
    } elsif ($s =~ s/^some line\n/p) { 
    push (@r, ['SOME LINE', ${^MATCH}]); 
    [...] 
    } else { 
    die ("Found garbage: " . Dumper ($s)); 
    } 
}

我當然可以端口的這個1：1到Python，但有一個更Python的方式來做到這一點？（我做不要解析逐行再建一個手工製作的狀態機之上。）

來源

2013-06-20 Tim Landscheidt

還有就是re模塊，它可能會有所幫助這裏的undocumented tool。你可以使用這樣的：

import re 
import sys 

def section(scanner, token): 
    return "SECTION", scanner.match.group(1) 

def some_line(scanner, token): 
    return "SOME LINE", token 

def garbage(scanner, token): 
    sys.exit('Found garbage: {}'.format(token)) 

# scanner will attempt to match these patterns in the order listed. 
# If there is a match, the second argument is called. 
scanner = re.Scanner([ 
    (r"section (\d+)$$", section), 
    (r"some line$", some_line), 
    (r"\s+", None), # skip whitespace 
    (r".+", garbage), # if you get here it's garbage 
    ], flags=re.MULTILINE) 


tokens, remainder = scanner.scan('''\ 

section 1 

some line 
''') 
for token in tokens: 
    print(token)

產生

('SECTION', '1') 
('SOME LINE', 'some line')

來源

2013-06-20 13:03:51 unutbu

如何使用Python將文件標記爲一系列正則表達式？

回答

相關問題