2013-07-05 158 views
1

我需要一種方法,在python中給出一串文本,將它的內容分隔成一個列表,用3個參數分隔 - 最外括號與最外括號對普通文本,保留原始語法。將括號內的字符串與括號中的內容分開vs對比括號中的內容

例如,給定一個字符串

(([a] b) c) [d] (e) f 

預期產出將是這個名單:

['(([a] b) c)', '[d]', '(e)', ' f'] 

我試着用正則表達式的幾件事情,如

\[.+?\]|\(.+?\)|[\w+ ?]+ 

這給了我

>>> re.findall(r'\[.+?\]|\(.+?\)|[\w+ ?]+', '(([a] b) c) [d] (e) f') 
['(([a] b)', ' c ', ' ', '[d]', ' ', '(e)', ' f'] 

(C項錯誤的列表)

我也試過它的貪婪版本,

\[.+\]|\(.+\)|[\w+ ?]+ 

但當串具有相同類型的獨立經營卻不盡如人意:

>>> re.findall(r'\[.+\]|\(.+\)|[\w+ ?]+', '(([a] b) c) [d] (e) f') 
['(([a] b) c) [d] (e)', ' f'] 

然後我從正則表達式轉移到使用堆棧來代替:

>>> def parenthetic_contents(string): 
    stack = [] 
    for i, c in enumerate(string): 
     if c == '(' or c == '[': 
      stack.append(i) 
     elif (c == ')' or c == ']'): 
      start = stack.pop() 
      yield (len(stack), string[start + 0:i+1]) 

,除了我這偉大的工作,爲括號和圓括號都沒有獲得純文本的方式(或者我做什麼,但我不知道這件事?):

>>> list(parenthetic_contents('(([a] b) c) [d] (e) f')) 
[(2, '[a]'), (1, '([a] b)'), (0, '(([a] b) c)'), (0, '[d]'), (0, '(e)')] 

我不熟悉pyparsing。它首先看起來好像nestedExpr()會做到這一點,但它只需要一個分隔符(()或[],但不是兩者),這對我不起作用。

我現在都沒有想法。歡迎大家提出意見。

+0

作爲一般規則,正則表達式無法與您的括號相匹配,因爲它們不會「打」(因爲它是)打開和關閉項目。這與[下推自動機](http://en.wikipedia.org/wiki/Pushdown_automaton)基本上比[有限狀態機]更強大(http://en.wikipedia.org/wiki/Finite_automaton )。更多:http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Pushdown_automaton.html – torek

回答

1

只有非常輕微測試(以及輸出包括白空間)。與@Marius的回答(以及關於paren匹配需要PDA的一般規則)一樣,我使用堆棧。不過,我有一點額外的偏執狂。

def paren_matcher(string, opens, closes): 
    """Yield (in order) the parts of a string that are contained 
    in matching parentheses. That is, upon encounting an "open 
    parenthesis" character (one in <opens>), we require a 
    corresponding "close parenthesis" character (the corresponding 
    one from <closes>) to close it. 

    If there are embedded <open>s they increment the count and 
    also require corresponding <close>s. If an <open> is closed 
    by the wrong <close>, we raise a ValueError. 
    """ 
    stack = [] 
    if len(opens) != len(closes): 
     raise TypeError("opens and closes must have the same length") 
    # could make sure that no closes[i] is present in opens, but 
    # won't bother here... 

    result = [] 
    for char in string: 
     # If it's an open parenthesis, push corresponding closer onto stack. 
     pos = opens.find(char) 
     if pos >= 0: 
      if result and not stack: # yield accumulated pre-paren stuff 
       yield ''.join(result) 
       result = [] 
      result.append(char) 
      stack.append(closes[pos]) 
      continue 
     result.append(char) 
     # If it's a close parenthesis, match it up. 
     pos = closes.find(char) 
     if pos >= 0: 
      if not stack or stack[-1] != char: 
       raise ValueError("unbalanced parentheses: %s" % 
        ''.join(result)) 
      stack.pop() 
      if not stack: # final paren closed 
       yield ''.join(result) 
       result = [] 
    if stack: 
     raise ValueError("unclosed parentheses: %s" % ''.join(result)) 
    if result: 
     yield ''.join(result) 

print list(paren_matcher('(([a] b) c) [d] (e) f', '([', ')]')) 
print list(paren_matcher('foo (bar (baz))', '(', ')')) 
+0

謝謝!我將需要花費更多的時間將其分開以理解它如何實際工作......但最終結果是我想要的:) –

1

我設法做到這一點,使用簡單的解析器,使用level變量記錄堆棧有多深。

import string 

def get_string_items(s): 
    in_object = False 
    level = 0 
    current_item = '' 
    for char in s: 
     if char in string.ascii_letters: 
      current_item += char 
      continue 
     if not in_object: 
      if char == ' ': 
       continue 
     if char in ('(', '['): 
      in_object = True 
      level += 1 
     elif char in (')', ']'): 
      level -= 1 
     current_item += char 
     if level == 0: 
      yield current_item 
      current_item = '' 
      in_object = False 
    yield current_item 

輸出:

list(get_string_items(s)) 
Out[4]: ['(([a] b) c)', '[d]', '(e)', 'f'] 
list(get_string_items('(hi | hello) world')) 
Out[12]: ['(hi | hello)', 'world'] 
+0

謝謝:)這確實似乎工作的大部分。它正確地進行分離,但是平面文本被分割成單個字符。所以,「list(get_string_items('(hi | hello)world'))」變成「['(hi | hello)','w','o','r','l','d']」 。不過,我可能會解決這個問題。 –

+0

啊,是的,對不起,我太專注於括號,忘記了平面文本的情況。我認爲這是一個非常簡單的修復,現在編輯 – Marius

+0

這個腳本仍然有一些問題(它在平面文本類別中佔用空間)。 Torek的代碼看起來像預期的那樣。儘管我感謝你的回答,但我仍然可以將其用於其他應用程序;) –

1

,您仍然可以使用nestedExpr,您要創建幾個表情,一個與各種分隔符:

from pyparsing import nestedExpr, Word, printables, quotedString, OneOrMore 

parenList = nestedExpr('(', ')') 
brackList = nestedExpr('[', ']') 
printableWord = Word(printables, excludeChars="()[]") 

expr = OneOrMore(parenList | brackList | quotedString | printableWord) 

sample = """(([a] b) c ")") [d] (e) f "(a quoted) [string] with()'s" """ 

import pprint 
pprint.pprint(expr.parseString(sample).asList()) 

打印:

[[['[a]', 'b'], 'c', '")"'], 
['d'], 
['e'], 
'f', 
'"(a quoted) [string] with()\'s"'] 

注意,在默認情況下,nestedExpr回報在嵌套結構中解析的內容。要保留原來的文本,包裹表達式originalTextFor

# preserve nested expressions as their original strings 
from pyparsing import originalTextFor 
parenList = originalTextFor(parenList) 
brackList = originalTextFor(brackList) 

expr = OneOrMore(parenList | brackList | quotedString | printableWord) 

pprint.pprint(expr.parseString(sample).asList()) 

打印:

['(([a] b) c ")")', '[d]', '(e)', 'f', '"(a quoted) [string] with()\'s"'] 
0

那麼,一旦你放棄的想法,解析嵌套表達式應該在無限的深度合作,可以使用正則表達式預先指定最大深度就可以了。這裏是如何:

def nested_matcher (n): 
    # poor man's matched paren scanning, gives up after n+1 levels. 
    # Matches any string with balanced parens or brackets inside; add 
    # the outer parens yourself if needed. Nongreedy. Does not 
    # distinguish parens and brackets as that would cause the 
    # expression to grow exponentially rather than linearly in size. 
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n 

import re 

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]') 
print p.findall('(([a] b) c) [d] (e) f') 

這將輸出

['(([a] b) c)', ' ', '[d]', ' ', '(e)', ' f'] 

這是不太你在上述的東西,但隨後你的描述,例如沒有真正作出明確你打算用空格做什麼。