2010-10-27 39 views
2

,我有以下格式的文件:什麼是一個快速的方法來在python中進行括號分割?

ID1 { some text } 
ID2 { some text } 

他們不必來通過行格式線,讓我們可以有:

ID1 { some [crlf] 
text [crlf] 
} 

ID2 [crlf] { some t [crlf] 
ex [crlf] 
t} 

等,這意味着some text能不止一行,並且可以在ID之後立即有一個CRLF。主要不變是所有ID都包含{}。 問題是some text本身可能有{}

什麼是採取這樣一個文件並將其分成一個字符串列表,每個是ID { text },同時考慮到嵌套括號的快速方法是什麼?

考慮到一些錯誤分析,如果括號不平衡,會很好。

+0

您無論如何可以格式化該文件,我的意思是改變數據是如何寫在這個文件,因爲這個文件看起來像一個混亂!我認爲在考慮如何檢索數據之前,應該先考慮如何編寫數據以使檢索更容易。 – mouad 2010-10-27 01:21:07

+0

「某些文字」是否包含「crlf」?如果不是的話,將它們去掉,事情變得容易很多...... – 2010-10-27 01:27:21

回答

2

regex是不可能的,obviously。你看過pyparsing

[編輯]

OTOH這可能工作:

from functools import wraps 


def transition(method): 
    @wraps(method) 
    def trans(state, *args, **kwargs): 
     command = method(state, *args, **kwargs) 
     state.__class__ = command(state) 
    return trans 


class State(object): 
    def __new__(cls): 
     state = object.__new__(cls) 
     state._identities = [] 
     return state 

def unchanged(state): 
    return state.__class__ 

def shifting(identity): 
    def command(state): 
     return identity 
    return command 

def pushing(identity, afterwards=None): 
    def command(state): 
     state._identities.append(afterwards or state.__class__) 
     return identity 
    return command 

def popped(state): 
    return state._identities.pop() 


############################################################################## 


import re 
tokenize = re.compile(flags=re.VERBOSE | re.MULTILINE, pattern=r""" 
    (?P<word>  \w+) | 
    (?P<braceleft> { ) | 
    (?P<braceright> } ) | 
    (?P<eoi>  $ ) | 
    (?P<error>  \S ) # catch all (except white space) 
""").finditer 

def parse(parser, source, builder): 
    for each in tokenize(source): 
     dispatch = getattr(parser, each.lastgroup) 
     dispatch(each.group(), builder) 


class ParsingState(State): 
    def eoi(self, token, *args): 
     raise ValueError('premature end of input in parsing state %s' % 
      self.__class__.__name__ 
     ) 
    def error(self, token, *args): 
     raise ValueError('parsing state %s does not understand token %s' % (
      self.__class__.__name__, token 
     )) 
    def __getattr__(self, name): 
     def raiser(token, *args): 
      raise ValueError(
       'parsing state %s does not understand token "%s" of type %s' % 
       (self.__class__.__name__, token, name) 
      ) 
     return raiser 


class Id(ParsingState): 
    @transition 
    def word(self, token, builder): 
     builder.add_id(token) 
     return shifting(BeginContent) 
    @transition 
    def eoi(self, token, builder): 
     return shifting(DoneParsing) 

class BeginContent(ParsingState): 
    @transition 
    def braceleft(self, token, builder): 
     return shifting(Content) 

class Content(ParsingState): 
    @transition 
    def word(self, token, builder): 
     builder.add_text(token) 
     return unchanged 
    @transition 
    def braceleft(self, token, builder): 
     builder.add_text(token) 
     return pushing(PushedContent) 
    @transition 
    def braceright(self, token, builder): 
     return shifting(Id) 

class PushedContent(Content): 
    @transition 
    def braceright(self, token, builder): 
     builder.add_text(token) 
     return popped 

class DoneParsing(ParsingState): 
    pass 

############################################################################## 


class Entry(object): 
    def __init__(self, idname): 
     self.idname = idname 
     self.text = [] 
    def __str__(self): 
     return '%s { %s }' % (self.idname, ' '.join(self.text)) 

class Builder(object): 
    def __init__(self): 
     self.entries = [] 
    def add_id(self, id_token): 
     self.entries.append(Entry(id_token)) 
    def add_text(self, text_token): 
     self.entries[-1].text.append(text_token) 


############################################################################## 


if __name__ == '__main__': 

    file_content = """ 
    id1 { some text } id2 { 
    some { text } 
    } 
    """ 

    builder = Builder() 
    parse(Id(), file_content, builder) 
    for entry in builder.entries: 
     print entry 
2

這是「我怎麼寫匹配括號rescursive體面解析器一個簡單的問題

鑑於這種語法:

STMT_LIST := STMT+ 
STMT := ID '{' DATA '}' 
DATA := TEXT | STMT 
ID := [a-z0-9]+ 
TEXT := [^}]* 

解析器可能看起來像:

import sys 
import re 

def parse(data): 
    """ 
    STMT 
    """ 
    while data: 
     data, statement_id, clause = parse_statement(data) 
     print repr((statement_id, clause)) 

def consume_whitespace(data): 
    return data.lstrip() 

def parse_statement(data): 
    m = re.match('[a-zA-Z0-9]+', data) 
    if not m: 
     raise ValueError, "No ID found" 
    statement_id = m.group(0) 
    data = consume_whitespace(data[len(statement_id):]) 
    data, clause = parse_clause(data) 
    return consume_whitespace(data), statement_id, clause 

def parse_clause(data): 
    clause = [] 
    if not data.startswith('{'): 
     raise ValueError, "No { found" 
    data = data[1:] 
    closebrace = data.index('}') 
    try: 
     openbrace = data.index('{') 
    except ValueError: 
     openbrace = sys.maxint 
    while openbrace < closebrace: 
     clause.append(data[:openbrace]) 
     data, subclause = parse_clause(data[openbrace:]) 
     clause.append(subclause) 

     closebrace = data.index('}') 
     try: 
      openbrace = data.index('{') 
     except ValueError: 
      openbrace = sys.maxint 
    clause.append(data[:closebrace]) 
    data = data[closebrace+1:] 
    return data, clause 

parse("ID { foo { bar } }") 
parse("ID { foo { bar } } baz { tee fdsa { fdsa } }") 

說實話,這是一個討厭的解析器。如果你想更好地構造它,你最終會得到一個來自lexxer的正確的標記流,並將其傳遞給實際的解析器。因爲它是'令牌流'只是一個字符串,我們從開始剝離信息。

我會建議看看pyparsing,如果你想要更復雜的東西。

+0

即使對於這種簡單的事情,pyparsing也不錯。 – 2010-10-27 01:49:34

0

這裏的蠻力方法,與包括或指示錯誤檢測:

# parsebrackets.py 
def parse_brackets(data): 
    # step 1: find the 0-nesting-level { and } 
    lpos = [] 
    rpos = [] 
    nest = 0 
    for i, c in enumerate(data): 
     if c == '{': 
      if nest == 0: 
       lpos.append(i) 
      nest += 1 
     elif c == '}': 
      nest -= 1 
      if nest < 0: 
       raise Exception('too many } at offset %d' % i) 
      if nest == 0: 
       rpos.append(i) 
    if nest > 0: 
     raise Exception('too many { in data') 
    prev = -1 
    # step 2: extract the pieces 
    for start, end in zip(lpos, rpos): 
     key = data[prev+1:start].strip() 
     # insert test for empty key here 
     text = data[start:end+1] 
     prev = end 
     yield key, text 
    if data[prev+1:].strip(): 
     raise Exception('non-blank text after last }') 

輸出:

>>> from parsebrackets import parse_brackets as pb 
>>> for k, t in pb(' foo {bar {zot\n}} guff {qwerty}'): 
... print repr(k), repr(t) 
... 
'foo' '{bar {zot\n}}' 
'guff' '{qwerty}' 
>>> 
4

使用pyparsing你可以敲了這一點,在約6線,然後用得到你的其他工作。這裏有一個解決方案兩個變化,這取決於你如何想的解析結果的結構:

data = """ID1 { some text } ID2 { some {with some more text nested in braces} text }""" 

from pyparsing import Word, alphas, alphanums, dictOf, nestedExpr, originalTextFor 

# identifier starts with any alpha, followed by any alpha, num, or '_' 
ident = Word(alphas,alphanums+"_") 

# Solution 1 
# list of items is a dict of pairs of idents and nested {}'s 
# - returns {}'s expressions as nested structures 
itemlist = dictOf(ident, nestedExpr("{","}")) 
items = itemlist.parseString(data) 
print items.dump() 

""" 
prints: 
[['ID1', ['some', 'text']], ['ID2', ['some', ['with', 'some', 'more', ... 
- ID1: ['some', 'text'] 
- ID2: ['some', ['with', 'some', 'more', 'text', 'nested', 'in', 'braces'], 'text'] 
""" 

# Solution 2 
# list of items is a dict of pairs of idents and nested {}'s 
# - returns {}'s expressions as strings of text extract from the 
# original input string 
itemlist = dictOf(ident, originalTextFor(nestedExpr("{","}"))) 
items = itemlist.parseString(data) 
print items.dump() 

""" 
prints: 
[['ID1', '{ some text }'], ['ID2', '{ some {with some more text nested in ... 
- ID1: { some text } 
- ID2: { some {with some more text nested in braces} text } 
""" 
相關問題