獲取文法讀取文本中的多個關鍵字

我仍然認爲自己是pyparsing的新手。我把兩個快速語法放在一起，並沒有成功實現我所要做的。我試圖想出一個看似非常簡單的語法，但事實證明（至少對我而言）並不那麼微不足道。該語言有一個基本定義。它由關鍵字和正文文本分解。身體的可以跨越多條線。關鍵字在前20個字符左右的行的開頭找到，但以';'結尾。（不含引號）。所以我把一個快速演示程序放在一起，這樣我就可以用幾個語法進行測試。但是，當我嘗試使用它們時，它們總是得到第一個關鍵字，但是之後沒有。獲取文法讀取文本中的多個關鍵字

我附上了源代碼作爲例子和正在發生的輸出。即使這只是測試代碼，出於習慣，我做了文檔。在下面的例子中，兩個關鍵字是NOW;最後;理想情況下，我不想在關鍵字中包含分號。

任何想法，我應該做什麼使這項工作？

from pyparsing import * 

def testString(text,grammar): 
    """ 
    @summary: perform a test of a grammar 
    2type text: text 
    @param text: text buffer for input (a message to be parsed) 
    @type grammar: MatchFirst or equivalent pyparsing construct 
    @param grammar: some grammar defined somewhere else 
    @type pgm: text 
    @param pgm: typically name of the program, which invoked this function. 
    @status: 20130802 CODED 
    """ 
    print 'Input Text is %s' % text 
    print 'Grammar is %s' % grammar 
    tokens = grammar.parseString(text) 
    print 'After parse string: %s' % tokens 
    tokens.dump() 
    tokens.keys() 

    return tokens 


def getText(msgIndex): 
    """ 
    @summary: make a text string suitable for parsing 
    @returns: returns a text buffer 
    @type msgIndex: int 
    @param msgIndex: a number corresponding to a text buffer to retrieve 
    @status: 20130802 CODED 
    """ 

    msg = [ """NOW; is the time for a few good ones to come to the aid 
of new things to come for it is almost time for 
a tornado to strike upon a small hill 
when least expected. 
lastly; another day progresses and 
then we find that which we seek 
and finally we will 
find our happiness perhaps its closer than 1 or 2 years or not so 
    """, 
     '', 
     ] 

    return msg[msgIndex] 

def getGrammar(grammarIndex): 
    """ 
    @summary: make a grammar given an index 
    @type: grammarIndex: int 
    @param grammarIndex: a number corresponding to the grammar to be retrieved 
    @Note: a good run will return 2 keys: NOW: and lastly: and each key will have an associated body. The body is all 
    words and text up to the next keyword or eof which ever is first. 
    """ 
    kw = Combine(Word(alphas + nums) + Literal(';'))('KEY') 
    kw.setDebug(True) 
    body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body') 
    body1.setDebug(True) 
    g1 = OneOrMore(Group(kw + body1)) 

    # ok start defining a new grammar (borrow kw from grammar). 

    body2 = SkipTo(~kw, include=False)('BODY') 
    body2.setDebug(True) 

    g2 = OneOrMore(Group(kw+body2)) 
    grammar = [g1, 
      g2, 
      ] 
    return grammar[grammarIndex] 


if __name__ == '__main__': 
    # list indices [ text, grammar ] 
    tests = {1: [0,0], 
     2: [0,1], 
     } 
    check = tests.keys() 
    check.sort() 
    for testno in check: 
    print 'STARTING Test %d' % testno 
    text = getText(tests[testno][0]) 
    grammar = getGrammar(tests[testno][1]) 
    tokens = testString(text, grammar) 
    print 'Tokens found %s' % tokens 
    print 'ENDING Test %d' % testno

輸出看起來是這樣的：（使用python 2.7和2.0.1 pyparsing）

STARTING Test 1 
    Input Text is NOW; is the time for a few good ones to come to the aid 
    of new things to come for it is almost time for 
    a tornado to strike upon a small hill 
    when least expected. 
    lastly; another day progresses and 
    then we find that which we seek 
    and finally we will 
    find our happiness perhaps its closer than 1 or 2 years or not so 

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]...})}... 
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1) 
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;'] 
    Match {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... at loc 4(1,5) 
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20) 
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20) 
    Matched {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... -> ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected'] 
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20) 
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20) 
    After parse string: [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']] 
    Tokens found [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']] 
    ENDING Test 1 
    STARTING Test 2 
    Input Text is NOW; is the time for a few good ones to come to the aid 
    of new things to come for it is almost time for 
    a tornado to strike upon a small hill 
    when least expected. 
    lastly; another day progresses and 
    then we find that which we seek 
    and finally we will 
    find our happiness perhaps its closer than 1 or 2 years or not so 

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) SkipTo:(~{Combine:({W:(abcd...) ";"})})})}... 
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1) 
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;'] 
    Match SkipTo:(~{Combine:({W:(abcd...) ";"})}) at loc 4(1,5) 
    Match Combine:({W:(abcd...) ";"}) at loc 4(1,5) 
    Exception raised:Expected ";" (at char 7), (line:1, col:8) 
    Matched SkipTo:(~{Combine:({W:(abcd...) ";"})}) -> [''] 
    Match Combine:({W:(abcd...) ";"}) at loc 5(1,6) 
    Exception raised:Expected ";" (at char 7), (line:1, col:8) 
    After parse string: [['NOW;', '']] 
    Tokens found [['NOW;', '']] 
    ENDING Test 2 

    Process finished with exit code 0

來源

2013-08-04 Div

我擅長與TDD，但在這裏，你的整個測試和替代選擇基礎設施真正得到的查看語法在哪裏以及發生什麼的方式。如果我脫光了所有多餘的機器，我看到你的語法就是：

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY') 
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body') 
g1 = OneOrMore(Group(kw + body1))

第一個問題，我看到的是你的body1的定義：

body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')

你是在正確的軌道上帶有負前瞻，但爲了在pyparsing中工作，你必須把它放在表達式的開頭，而不是結尾。把它看成是「之前，我匹配另一個有效的話，我會先排除它是一個關鍵詞。」？

body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')

（爲什麼這是一個delimitedList，順便delimitedList通常保留給真正的名單使用逗號分隔符（例如用逗號分隔的程序函數參數），所有這些都可以接受任何可能混入正文的逗號，這應該使用標點符號更直接地處理。）

這裏是我的測試版本您的代碼：

from pyparsing import * 

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY') 
body1 = OneOrMore(~kw + Word(alphas + nums))('Body') 
g1 = OneOrMore(Group(kw + body1)) 

msg = [ """NOW; is the time for a few good ones to come to the aid 
of new things to come for it is almost time for 
a tornado to strike upon a small hill 
when least expected. 
lastly; another day progresses and 
then we find that which we seek 
and finally we will 
find our happiness perhaps its closer than 1 or 2 years or not so 
    """, 
      '', 
      ][0] 

result = g1.parseString(msg) 
# we expect multiple groups, each containing "KEY" and "Body" names, 
# so iterate over groups, and dump the contents of each 
for res in result: 
    print res.dump()

我仍然得到與您相同的結果，只是第一個關鍵字匹配。所以，看看那裏的斷開正在發生的事情，我用scanString，不僅返回匹配的令牌，而且匹配的令牌的開始和結束：這給了我

result,start,end = next(g1.scanString(msg)) 
print len(msg),end

：

320 161

因此，我認爲，我們在位置161的字符串，其總長度爲320結尾的，所以我會增加一個print語句：

print msg[end:end+10]

，我也得到：

. 
lastly;

在你的身體文本的尾隨期是罪魁禍首。如果我從郵件中刪除，並再次嘗試parseString，我現在得到：

['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected'] 
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected'] 
- KEY: NOW; 
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so'] 
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so'] 
- KEY: lastly;

如果你想處理標點符號，我建議你添加類似：

PUNC = oneOf(". , ? ! : & $")

，並把它添加到body1：

body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')

來源

2013-08-04 14:28:58 PaulMcG

感謝您的反饋。希望很快我會開始感受到這一點。 – Div

另外感謝您對scanString的提示。我打算利用它。 – Div

獲取文法讀取文本中的多個關鍵字

回答

相關問題