2017-03-16 85 views
1

這是我第一次嘗試使用pyparsing,我很難設置它。我想用pyparsing來解析lexc文件。格式lexc用於聲明編譯成有限狀態傳感器的詞典。使用python(pyparsing)來解析lexc

特殊字符:

: divides 'upper' and 'lower' sides of a 'data' declaration 
; terminates entry 
# reserved LEXICON name. end-of-word or final state 
' ' (space) universal delimiter 
! introduces comment to the end of the line 
< introduces xfst-style regex 
> closes xfst-style regex 
% escape character: %: %; %# % %! %< %> %% 

有多個層次來解析。

通用來說,任何從非轉義!到換行符都是一條評論。這可以在每個級別單獨處理。

在文檔級別,有三個不同的部分:

Multichar_Symbols Optional one-time declaration 
LEXICON    Usually many of these 
END     Anything after this is ignored 

Multichar_Symbols水平,任何用空格隔開是一個聲明。本部分結尾於LEXICON的第一個聲明。

Multichar_Symbols the+first-one thesecond_one 
third_one ! comment that this one is special 
+Pl  ! plural 

LEXICON水平,LEXICON的名字被聲明爲:

LEXICON the_name ! whitespace delimited 

名稱聲明之後,一個詞庫條目組成:data continuation ;。分號分隔條目。 data是可選的。

data水平,有三種可能的形式:

  1. upper:lower

  2. simple(其被分解到upperlowersimple:simple

  3. <xfst-style regex>

例子:

END一切後
! # is a reserved continuation that means "end of word". 
dog+Pl:dogs # ; ! upper:lower continuation ; 
cat # ;   ! automatically exploded to "cat:cat # ;" by interpreter 
Num ;   ! no data, only a continuation to LEXICON named "Num" 
<[1|2|3]+> # ; ! xfst-style regex enclosed in <> 

被忽略

完整lexc文件可能是這樣的:

! Comments begin with ! 

! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON) 
Multichar_Symbols +A +N +V ! +A is adjectives, +N is nouns, +V is verbs 
+Adv ! This one is for adverbs 
+Punc ! punctuation 
! +Cmpar ! This is broken for now, so I commented it out. 

! The bulk of lexc is made of up LEXICONs, which contain entries that point to 
! other LEXICONs. "Root" is a reserved lexicon name, and the start state. 
! "#" is also a reserved lexicon name, and the end state. 

LEXICON Root ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root 
big Adj ; ! This 
bigly Adv ; ! Not sure if this is a real word... 
dog Noun ; 
cat Noun ; 
crow Noun ; 
crow Verb ; 
Num ;  ! This continuation class generates numbers using xfst-style regex 

! NB all the following are reserved characters 

sour% cream Noun ; ! escaped space 
%: Punctuation ; ! escaped : 
%; Punctuation ; ! escaped ; 
%# Punctuation ; ! escaped # 
%! Punctuation ; ! escaped ! 
%% Punctuation ; ! escaped % 
%< Punctuation ; ! escaped < 
%> Punctuation ; ! escaped > 

%:%:%::%: # ; ! Should map ::: to : 

LEXICON Adj 
+A: # ;  ! # is a reserved lexicon name which means end-of-word (final state). 
! +Cmpar:er # ; ! Broken, so I commented it out. 

LEXICON Adv 
+Adv: # ; 

LEXICON Noun 
+N+Sg: # ; 
+N+Pl:s # ; 

LEXICON Num 
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation 
# ; ! After the first cycle, this makes sense, but as it is, this is bad. 

LEXICON Verb 
+V+Inf: # ; 
+V+Pres:s # ; 

LEXICON Punctuation 
+Punc: # ; 

END 

This text is ignored because it is after END 

因此,有多個不同的層次處進行解析。在pyparsing中設置此項的最佳方法是什麼?這種分層語言有沒有例子可以作爲模型遵循?

回答

1

使用pyparsing時的策略是將解析問題分解成小部分,然後將它們組合成更大的部分。

開始你的第一個高層次的結構定義:

Multichar_Symbols Optional one-time declaration 
LEXICON    Usually many of these 
END     Anything after this is ignored 

你最終整體解析器會看起來像:

parser = (Optional(multichar_symbols_section)('multichar_symbols') 
      + Group(OneOrMore(lexicon_section))('lexicons') 
      + END) 

的名稱在括號內各部分後,會給我們標籤使它容易訪問解析結果的不同部分。

深入細節,我們來看看如何定義lexicon_section的解析器。

首先定義標點符號和特殊的關鍵字

COLON,SEMI = map(Suppress, ":;") 
HASH = Literal('#') 
LEXICON, END = map(Keyword, "LEXICON END".split()) 

你的標識符和值可以包含「%」 - 轉義字符,所以我們需要從片建立他們:

# use regex and Combine to handle % escapes 
escaped_char = Regex(r'%.').setParseAction(lambda t: t[0][1]) 
ident_lit_part = Word(printables, excludeChars=':%;') 
xfst_regex = Regex(r'<.*?>') 
ident = Combine(OneOrMore(escaped_char | ident_lit_part)) | xfst_regex 
value_expr = ident() 

隨着這些作品,我們現在可以定義單個詞典聲明:

# handle the following lexicon declarations: 
# name ; 
# name:value ; 
# name value ; 
# name value # ; 
lexicon_decl = Group(ident("name") 
        + Optional(Optional(COLON) 
           + value_expr("value") 
           + Optional(HASH)('hash')) 
        + SEMI) 

這部分是有點混亂,事實證明,value可以作爲字符串,結果結構(pyparsing ParseResults)返回,或者甚至可能完全丟失。我們可以使用分析操作將所有這些表單規範化爲單個字符串形式。

# use a parse action to normalize the parsed values 
def fixup_value(tokens): 
    if 'value' in tokens[0]: 
     # pyparsing makes this a nested element, just take zero'th value 
     if isinstance(tokens[0].value, ParseResults): 
      tokens[0]['value'] = tokens[0].value[0] 
    else: 
     # no value was given, expand 'name' as if parsed 'name:name' 
     tokens[0]['value'] = tokens[0].name 
lexicon_decl.setParseAction(fixup_value) 

現在該值將在分析時清除,因此在調用parseString之後不需要額外的代碼。

我們終於準備來定義整個LEXICON部分:

# TBD - make name optional, define as 'Root' 
lexicon_section = Group(LEXICON 
         + ident("name") 
         + ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations")) 

看家的最後一點 - 我們需要忽略的意見。我們可以在最上方的分析器表達打電話ignore和意見將在整個解析器忽略:

# ignore comments anywhere in our parser 
comment = '!' + Optional(restOfLine) 
parser.ignore(comment) 

這裏是一個單拷貝pasteable節的所有代碼:

import pyparsing as pp 

# define punctuation and special words 
COLON,SEMI = map(pp.Suppress, ":;") 
HASH = pp.Literal('#') 
LEXICON, END = map(pp.Keyword, "LEXICON END".split()) 

# use regex and Combine to handle % escapes 
escaped_char = pp.Regex(r'%.').setParseAction(lambda t: t[0][1]) 
ident_lit_part = pp.Word(pp.printables, excludeChars=':%;') 
xfst_regex = pp.Regex(r'<.*?>') 
ident = pp.Combine(pp.OneOrMore(escaped_char | ident_lit_part | xfst_regex)) 
value_expr = ident() 


# handle the following lexicon declarations: 
# name ; 
# name:value ; 
# name value ; 
# name value # ; 
lexicon_decl = pp.Group(ident("name") 
        + pp.Optional(pp.Optional(COLON) 
           + value_expr("value") 
           + pp.Optional(HASH)('hash')) 
        + SEMI) 

# use a parse action to normalize the parsed values 
def fixup_value(tokens): 
    if 'value' in tokens[0]: 
     # pyparsing makes this a nested element, just take zero'th value 
     if isinstance(tokens[0].value, pp.ParseResults): 
      tokens[0]['value'] = tokens[0].value[0] 
    else: 
     # no value was given, expand 'name' as if parsed 'name:name' 
     tokens[0]['value'] = tokens[0].name 
lexicon_decl.setParseAction(fixup_value) 

# define a whole LEXICON section 
# TBD - make name optional, define as 'Root' 
lexicon_section = pp.Group(LEXICON 
         + ident("name") 
         + pp.ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations")) 

# this part still TBD - just put in a placeholder for now 
multichar_symbols_section = pp.empty() 

# tie it all together 
parser = (pp.Optional(multichar_symbols_section)('multichar_symbols') 
      + pp.Group(pp.OneOrMore(lexicon_section))('lexicons') 
      + END) 

# ignore comments anywhere in our parser 
comment = '!' + pp.Optional(pp.restOfLine) 
parser.ignore(comment) 

解析您發佈的 '根' 的樣品,我們可以使用dump()

result = lexicon_section.parseString(lexicon_sample)[0] 
print(result.dump()) 

給予轉儲結果:

['LEXICON', 'Root', ['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']] 
- declarations: [['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']] 
    [0]: 
    ['big', 'Adj'] 
    - name: 'big' 
    - value: 'Adj' 
    [1]: 
    ['bigly', 'Adv'] 
    - name: 'bigly' 
    - value: 'Adv' 
    [2]: 
    ['dog', 'Noun'] 
    - name: 'dog' 
    - value: 'Noun' 
    ... 
    [13]: 
    ['<', 'Punctuation'] 
    - name: '<' 
    - value: 'Punctuation' 
    [14]: 
    ['>', 'Punctuation'] 
    - name: '>' 
    - value: 'Punctuation' 
    [15]: 
    [':::', ':', '#'] 
    - hash: '#' 
    - name: ':::' 
    - value: ':' 
- name: 'Root' 

這段代碼演示瞭如何遍歷部分的零件和獲得命名的部分:

# try out a lexicon against the posted sample 
result = lexicon_section.parseString(lexicon_sample)[0] 
print(result.dump()) 

print('Name:', result.name) 
print('\nDeclarations') 
for decl in result.declarations: 
    print("{name} -> {value}".format_map(decl), "(END)" if decl.hash else '') 

,並提供:

Name: Root 

Declarations 
big -> Adj 
bigly -> Adv 
dog -> Noun 
cat -> Noun 
crow -> Noun 
crow -> Verb 
Num -> Num 
sour cream -> Noun 
: -> Punctuation 
; -> Punctuation 
# -> Punctuation 
! -> Punctuation 
% -> Punctuation 
< -> Punctuation 
> -> Punctuation 
::: -> : (END) 

希望這會給你足夠的把它從這裏。

+0

哇!這是比我預期的更徹底的答案!星期一我會有更多的時間來看。謝謝! – reynoldsnlp

+0

我不明白'value_expr = ident()'在做什麼。 b/w'ident'和'value_expr'有什麼區別?它們似乎都是同一類型的對象。 – reynoldsnlp

+0

這是一個很好的區別,'value_expr = ident'也可以。區別在於'ident()'返回'ident'的一個副本('value_expr = ident.copy()'的簡寫形式),所以如果您想要將解析操作或其他功能附加到-ident-expression - 這是一個右手邊值,那麼你可以安全地在'value_expr'上做,'ident'不會受到影響。 – PaulMcG