將括號內的字符串與括號中的內容分開vs對比括號中的內容

我需要一種方法，在python中給出一串文本，將它的內容分隔成一個列表，用3個參數分隔 - 最外括號與最外括號對普通文本，保留原始語法。將括號內的字符串與括號中的內容分開vs對比括號中的內容

例如，給定一個字符串

(([a] b) c) [d] (e) f

預期產出將是這個名單：

['(([a] b) c)', '[d]', '(e)', ' f']

我試着用正則表達式的幾件事情，如

\[.+?\]|\(.+?\)|[\w+ ?]+

這給了我

>>> re.findall(r'\[.+?\]|\(.+?\)|[\w+ ?]+', '(([a] b) c) [d] (e) f') 
['(([a] b)', ' c ', ' ', '[d]', ' ', '(e)', ' f']

（C項錯誤的列表）

我也試過它的貪婪版本，

\[.+\]|\(.+\)|[\w+ ?]+

但當串具有相同類型的獨立經營卻不盡如人意：

>>> re.findall(r'\[.+\]|\(.+\)|[\w+ ?]+', '(([a] b) c) [d] (e) f') 
['(([a] b) c) [d] (e)', ' f']

然後我從正則表達式轉移到使用堆棧來代替：

>>> def parenthetic_contents(string): 
    stack = [] 
    for i, c in enumerate(string): 
     if c == '(' or c == '[': 
      stack.append(i) 
     elif (c == ')' or c == ']'): 
      start = stack.pop() 
      yield (len(stack), string[start + 0:i+1])

，除了我這偉大的工作，爲括號和圓括號都沒有獲得純文本的方式（或者我做什麼，但我不知道這件事？）：

>>> list(parenthetic_contents('(([a] b) c) [d] (e) f')) 
[(2, '[a]'), (1, '([a] b)'), (0, '(([a] b) c)'), (0, '[d]'), (0, '(e)')]

我不熟悉pyparsing。它首先看起來好像nestedExpr（）會做到這一點，但它只需要一個分隔符（（）或[]，但不是兩者），這對我不起作用。

我現在都沒有想法。歡迎大家提出意見。

來源

2013-07-05 Minas Abovyan

作爲一般規則，正則表達式無法與您的括號相匹配，因爲它們不會「打」（因爲它是）打開和關閉項目。這與[下推自動機]（http://en.wikipedia.org/wiki/Pushdown_automaton）基本上比[有限狀態機]更強大（http://en.wikipedia.org/wiki/Finite_automaton ）。更多：http：//www.princeton.edu/~achaney/tmve/wiki100k/docs/Pushdown_automaton.html – torek

只有非常輕微測試（以及輸出包括白空間）。與@Marius的回答（以及關於paren匹配需要PDA的一般規則）一樣，我使用堆棧。不過，我有一點額外的偏執狂。

def paren_matcher(string, opens, closes): 
    """Yield (in order) the parts of a string that are contained 
    in matching parentheses. That is, upon encounting an "open 
    parenthesis" character (one in <opens>), we require a 
    corresponding "close parenthesis" character (the corresponding 
    one from <closes>) to close it. 

    If there are embedded <open>s they increment the count and 
    also require corresponding <close>s. If an <open> is closed 
    by the wrong <close>, we raise a ValueError. 
    """ 
    stack = [] 
    if len(opens) != len(closes): 
     raise TypeError("opens and closes must have the same length") 
    # could make sure that no closes[i] is present in opens, but 
    # won't bother here... 

    result = [] 
    for char in string: 
     # If it's an open parenthesis, push corresponding closer onto stack. 
     pos = opens.find(char) 
     if pos >= 0: 
      if result and not stack: # yield accumulated pre-paren stuff 
       yield ''.join(result) 
       result = [] 
      result.append(char) 
      stack.append(closes[pos]) 
      continue 
     result.append(char) 
     # If it's a close parenthesis, match it up. 
     pos = closes.find(char) 
     if pos >= 0: 
      if not stack or stack[-1] != char: 
       raise ValueError("unbalanced parentheses: %s" % 
        ''.join(result)) 
      stack.pop() 
      if not stack: # final paren closed 
       yield ''.join(result) 
       result = [] 
    if stack: 
     raise ValueError("unclosed parentheses: %s" % ''.join(result)) 
    if result: 
     yield ''.join(result) 

print list(paren_matcher('(([a] b) c) [d] (e) f', '([', ')]')) 
print list(paren_matcher('foo (bar (baz))', '(', ')'))

來源

2013-07-05 00:43:46 torek

謝謝！我將需要花費更多的時間將其分開以理解它如何實際工作......但最終結果是我想要的:) –

我設法做到這一點，使用簡單的解析器，使用level變量記錄堆棧有多深。

import string 

def get_string_items(s): 
    in_object = False 
    level = 0 
    current_item = '' 
    for char in s: 
     if char in string.ascii_letters: 
      current_item += char 
      continue 
     if not in_object: 
      if char == ' ': 
       continue 
     if char in ('(', '['): 
      in_object = True 
      level += 1 
     elif char in (')', ']'): 
      level -= 1 
     current_item += char 
     if level == 0: 
      yield current_item 
      current_item = '' 
      in_object = False 
    yield current_item

輸出：

list(get_string_items(s)) 
Out[4]: ['(([a] b) c)', '[d]', '(e)', 'f'] 
list(get_string_items('(hi | hello) world')) 
Out[12]: ['(hi | hello)', 'world']

來源

2013-07-05 00:13:45 Marius

謝謝:)這確實似乎工作的大部分。它正確地進行分離，但是平面文本被分割成單個字符。所以，「list（get_string_items（'（hi | hello）world'））」變成「['（hi | hello）'，'w'，'o'，'r'，'l'，'d']」。不過，我可能會解決這個問題。 –

啊，是的，對不起，我太專注於括號，忘記了平面文本的情況。我認爲這是一個非常簡單的修復，現在編輯 – Marius

這個腳本仍然有一些問題（它在平面文本類別中佔用空間）。 Torek的代碼看起來像預期的那樣。儘管我感謝你的回答，但我仍然可以將其用於其他應用程序;） –

，您仍然可以使用nestedExpr，您要創建幾個表情，一個與各種分隔符：

from pyparsing import nestedExpr, Word, printables, quotedString, OneOrMore 

parenList = nestedExpr('(', ')') 
brackList = nestedExpr('[', ']') 
printableWord = Word(printables, excludeChars="()[]") 

expr = OneOrMore(parenList | brackList | quotedString | printableWord) 

sample = """(([a] b) c ")") [d] (e) f "(a quoted) [string] with()'s" """ 

import pprint 
pprint.pprint(expr.parseString(sample).asList())

打印：

[[['[a]', 'b'], 'c', '")"'], 
['d'], 
['e'], 
'f', 
'"(a quoted) [string] with()\'s"']

注意，在默認情況下，nestedExpr回報在嵌套結構中解析的內容。要保留原來的文本，包裹表達式originalTextFor：

# preserve nested expressions as their original strings 
from pyparsing import originalTextFor 
parenList = originalTextFor(parenList) 
brackList = originalTextFor(brackList) 

expr = OneOrMore(parenList | brackList | quotedString | printableWord) 

pprint.pprint(expr.parseString(sample).asList())

打印：

['(([a] b) c ")")', '[d]', '(e)', 'f', '"(a quoted) [string] with()\'s"']

來源

2013-07-06 23:51:46 PaulMcG

那麼，一旦你放棄的想法，解析嵌套表達式應該在無限的深度合作，可以使用正則表達式預先指定最大深度就可以了。這裏是如何：

def nested_matcher (n): 
    # poor man's matched paren scanning, gives up after n+1 levels. 
    # Matches any string with balanced parens or brackets inside; add 
    # the outer parens yourself if needed. Nongreedy. Does not 
    # distinguish parens and brackets as that would cause the 
    # expression to grow exponentially rather than linearly in size. 
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n 

import re 

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]') 
print p.findall('(([a] b) c) [d] (e) f')

這將輸出

['(([a] b) c)', ' ', '[d]', ' ', '(e)', ' f']

這是不太你在上述的東西，但隨後你的描述，例如沒有真正作出明確你打算用空格做什麼。

來源

2014-04-02 17:48:30 user3489112

將括號內的字符串與括號中的內容分開vs對比括號中的內容

回答

相關問題