這是我第一次嘗試使用pyparsing
,我很難設置它。我想用pyparsing來解析lexc
文件。格式lexc
用於聲明編譯成有限狀態傳感器的詞典。使用python(pyparsing)來解析lexc
特殊字符:
: divides 'upper' and 'lower' sides of a 'data' declaration
; terminates entry
# reserved LEXICON name. end-of-word or final state
' ' (space) universal delimiter
! introduces comment to the end of the line
< introduces xfst-style regex
> closes xfst-style regex
% escape character: %: %; %# % %! %< %> %%
有多個層次來解析。
通用來說,任何從非轉義!
到換行符都是一條評論。這可以在每個級別單獨處理。
在文檔級別,有三個不同的部分:
Multichar_Symbols Optional one-time declaration
LEXICON Usually many of these
END Anything after this is ignored
在Multichar_Symbols
水平,任何用空格隔開是一個聲明。本部分結尾於LEXICON
的第一個聲明。
Multichar_Symbols the+first-one thesecond_one
third_one ! comment that this one is special
+Pl ! plural
在LEXICON
水平,LEXICON
的名字被聲明爲:
LEXICON the_name ! whitespace delimited
名稱聲明之後,一個詞庫條目組成:data continuation ;
。分號分隔條目。 data
是可選的。
在data
水平,有三種可能的形式:
upper:lower
,simple
(其被分解到upper
和lower
如simple:simple
,<xfst-style regex>
例子:
END
一切後
! # is a reserved continuation that means "end of word".
dog+Pl:dogs # ; ! upper:lower continuation ;
cat # ; ! automatically exploded to "cat:cat # ;" by interpreter
Num ; ! no data, only a continuation to LEXICON named "Num"
<[1|2|3]+> # ; ! xfst-style regex enclosed in <>
被忽略
完整lexc
文件可能是這樣的:
! Comments begin with !
! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON)
Multichar_Symbols +A +N +V ! +A is adjectives, +N is nouns, +V is verbs
+Adv ! This one is for adverbs
+Punc ! punctuation
! +Cmpar ! This is broken for now, so I commented it out.
! The bulk of lexc is made of up LEXICONs, which contain entries that point to
! other LEXICONs. "Root" is a reserved lexicon name, and the start state.
! "#" is also a reserved lexicon name, and the end state.
LEXICON Root ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root
big Adj ; ! This
bigly Adv ; ! Not sure if this is a real word...
dog Noun ;
cat Noun ;
crow Noun ;
crow Verb ;
Num ; ! This continuation class generates numbers using xfst-style regex
! NB all the following are reserved characters
sour% cream Noun ; ! escaped space
%: Punctuation ; ! escaped :
%; Punctuation ; ! escaped ;
%# Punctuation ; ! escaped #
%! Punctuation ; ! escaped !
%% Punctuation ; ! escaped %
%< Punctuation ; ! escaped <
%> Punctuation ; ! escaped >
%:%:%::%: # ; ! Should map ::: to :
LEXICON Adj
+A: # ; ! # is a reserved lexicon name which means end-of-word (final state).
! +Cmpar:er # ; ! Broken, so I commented it out.
LEXICON Adv
+Adv: # ;
LEXICON Noun
+N+Sg: # ;
+N+Pl:s # ;
LEXICON Num
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation
# ; ! After the first cycle, this makes sense, but as it is, this is bad.
LEXICON Verb
+V+Inf: # ;
+V+Pres:s # ;
LEXICON Punctuation
+Punc: # ;
END
This text is ignored because it is after END
因此,有多個不同的層次處進行解析。在pyparsing
中設置此項的最佳方法是什麼?這種分層語言有沒有例子可以作爲模型遵循?
哇!這是比我預期的更徹底的答案!星期一我會有更多的時間來看。謝謝! – reynoldsnlp
我不明白'value_expr = ident()'在做什麼。 b/w'ident'和'value_expr'有什麼區別?它們似乎都是同一類型的對象。 – reynoldsnlp
這是一個很好的區別,'value_expr = ident'也可以。區別在於'ident()'返回'ident'的一個副本('value_expr = ident.copy()'的簡寫形式),所以如果您想要將解析操作或其他功能附加到-ident-expression - 這是一個右手邊值,那麼你可以安全地在'value_expr'上做,'ident'不會受到影響。 – PaulMcG