2017-08-17 110 views
2

我有一個wiki文本一樣,我怎麼能離開空格在nestedExpr pyparsing

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 

「」」

,我想提取裏面的宏 宏{{之間包圍的文本和}}

所以我試圖用nestedExpr

from pyparsing import * 
import pprint 

def getMacroCandidates(txt): 

    candidates = [] 

    def nestedExpr(opener="(", closer=")", content=None, ignoreExpr=quotedString.copy()): 
     if opener == closer: 
      raise ValueError("opening and closing strings cannot be the same") 
     if content is None: 
      if isinstance(opener,str) and isinstance(closer,str): 
       if ignoreExpr is not None: 
        content = (Combine(OneOrMore(~ignoreExpr + 
            ~Literal(opener) + ~Literal(closer) + 
            CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1)) 
           ).setParseAction(lambda t:t[0])) 
     ret = Forward() 
     ret <<= Group(opener + ZeroOrMore(ignoreExpr | ret | content) + closer) 

     ret.setName('nested %s%s expression' % (opener,closer)) 
     return ret 

    # use {}'s for nested lists 
    macro = nestedExpr("{{", "}}") 
    # print(((nestedItems+stringEnd).parseString(data).asList())) 
    for toks, preloc, nextloc in macro.scanString(data): 
     print(toks) 
    return candidates 

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 
""" 

getMacroCandidates(data) 

這使我的標記和空格去掉

[['{{', 'hello', '}}']] 
[['{{', 'hello', 'world', '}}']] 
[['{{', 'hello', 'much', '{', '}}']] 
[['{{', 'a', ['{{', 'b', '}}'], '}}']] 
[['{{', 'a', 'td', '{', '}', ['{{', 'inner', '}}'], '}}']] 

預先感謝您

+0

要得到一個解析表達式的原始文本,你可以使用'originalTextFor'幫手:'宏= originalTextFor(nestedExpr (「{{」,「}}」))'。這將保留所有空格,換行符等。 – PaulMcG

回答

0

您可以更換

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 
""" 

import shlex 
data1= data.replace("{{",'"') 
data2 = data1.replace("}}",'"') 
data3= data2.replace("}"," ") 
data4= data3.replace("{"," ") 
data5= ' '.join(data4.split()) 
print(shlex.split(data5.replace("\n"," "))) 

輸出

這將返回所有的標記用大括號和用額外的線條空間去除的空白區域也被刪除

['hello', 'hello world', 'hello much ', 'a b', 'a td inner '] 

PS:這可以給一個表達式進行多項表達用於可讀性