2013-06-20 30 views
0

我想將文件解析爲令牌列表。每個標記至少包含一行,但可以包含更多。每個標記都與正則表達式匹配。如果輸入不是一個令牌序列(即沒有垃圾導致,中間或尾隨),我想要發出一個錯誤信號。我不關心內存消耗,因爲輸入文件相對較小。如何使用Python將文件標記爲一系列正則表達式?

在Perl中,我會使用類似(僞代碼):

$s = slurp_file(); 
while ($s ne '') { 
    if ($s =~ s/^\nsection (\d)\n\n/p) { 
    push (@r, ['SECTION ' . $1, ${^MATCH}]); 
    } elsif ($s =~ s/^some line\n/p) { 
    push (@r, ['SOME LINE', ${^MATCH}]); 
    [...] 
    } else { 
    die ("Found garbage: " . Dumper ($s)); 
    } 
} 

我當然可以端口的這個1:1到Python,但有一個更Python的方式來做到這一點? (我做要解析逐行再建一個手工製作的狀態機之上。)

回答

2

還有就是re模塊,它可能會有所幫助這裏的undocumented tool。你可以使用這樣的:

import re 
import sys 

def section(scanner, token): 
    return "SECTION", scanner.match.group(1) 

def some_line(scanner, token): 
    return "SOME LINE", token 

def garbage(scanner, token): 
    sys.exit('Found garbage: {}'.format(token)) 

# scanner will attempt to match these patterns in the order listed. 
# If there is a match, the second argument is called. 
scanner = re.Scanner([ 
    (r"section (\d+)$$", section), 
    (r"some line$", some_line), 
    (r"\s+", None), # skip whitespace 
    (r".+", garbage), # if you get here it's garbage 
    ], flags=re.MULTILINE) 


tokens, remainder = scanner.scan('''\ 

section 1 

some line 
''') 
for token in tokens: 
    print(token) 

產生

('SECTION', '1') 
('SOME LINE', 'some line') 
相關問題