正則表達式與Python：的findall一個boundry內部

我有一個字符串，它可以是由下列ilustrated（extraspaces意）：正則表達式與Python：的findall一個boundry內部

"words that don't matter START some words one  some words two  some words three END words that don't matter"

抓住START和END ['some words one', some words two', 'some words three']之間的每個子串，我寫以下代碼：

result = re.search(r'(?<=START).*?(?=END)', string, flags=re.S).group() 
result = re.findall(r'(\(?\w+(?:\s\w+)*\)?)', result)

是否有可能用一個單一的正則表達式實現這一點？

來源

2017-09-30 Leandro Ribeiro

隨着新regex模塊，你可以做到一步到位：

(?:\G(?!\A)|START)\s*\K 
(?!\bEND\b) 
\w+\s+\w+\s+\w+

這看起來很複雜，但細分，它說：

(?:\G(?!\A)|START) # look for START or the end of the last match 
\s*\K    # whitespaces, \K "forgets" all characters to the left 
(?!\bEND\b)   # neg. lookahead, do not overrun END 
\w+\s+\w+\s+\w+  # your original expression

在 Python這個樣子：

import regex as re 

rx = re.compile(r''' 
     (?:\G(?!\A)|START)\s*\K 
     (?!\bEND\b) 
     \w+\s+\w+\s+\w+''', re.VERBOSE) 

string = "words that don't matter START some words one  some words two  some words three END words that don't matter" 

print(rx.findall(string)) 
# ['some words one', 'some words two', 'some words three']

此外，請參閱 a demo on regex101.com。

來源

2017-10-01 08:43:48 Jan

這就是我正在尋找的：一個正則表達式解決方案。這是相當新的模塊，對吧？我不知道這件事。我還需要了解IF x THEN | ELSE在正則表達式中的可能性。 –

@LeandroRibeiro：的確如此。看看https://regexone.com/和http://rexegg.com/（很高級，但很棒）。 – Jan

我改變了你的正則表達式[一點]（https://regex101.com/r/oLFVRk/2/）。這樣它就可以抓取所有子字符串，而不管字數。我的例子有三個字的子字符串，但我需要它匹配每個字符串未知數量的字：（？：\ G（？！\ A）| START）\ s * \ K （？！\ bEND \ b） \ w +（？：\ s \ w +）* –

從理論上講，你可以把你的第二個正則表達式包含在()*中，並將它放到第一個正則表達式中。這將捕捉你在內心表達的所有事件。不幸的是，Python實現只保留多次匹配的組的最後一個匹配。我知道的只保留一個組的所有匹配的唯一實現是.NET。所以不幸的是你不是一個解決方案。

另一方面，爲什麼你不能簡單地保持你有兩步法？

編輯：您可以比較我使用正則表達式的在線工具所描述的行爲。

模式：(\w+\s*)*輸入：aaa bbb ccc

與https://pythex.org/和http://regexstorm.net/tester試一試例如。你會看到Python返回一個匹配/組，這是ccc而.NET返回$1三個捕獲aaa, bbb, ccc。

EDIT2：由於@Jan說，也有支持多捕獲的新regex模塊。我完全忘記了這一點。

來源

2017-09-30 23:43:33 PeterE

0123w_why你不能簡單地保持你的兩步法嗎？我會的，但它讓我想知道一個可能實現它的單一正則表達式模式，因爲我試圖把它學習成最好的我可以。 Intresting：我剛剛意識到實際上是通過一段代碼實現的： 'actors = re.findall（r'Actors [\ n \ r \ t]（[\ w \ s \ - \'\，] *）[\ n \ r \ t]舞臺'，船員）' 這個可以工作，但是素材略有不同，我找不到一種方法使它與原始示例一起工作。 –

這是一個理想的情況，我們可以使用re.split，正如@PeterE所提到的，用於規避僅訪問最後捕獲的組的問題。

import re 
s=r'"words that don\'t matter START some words one  some words two  some words three END words that don\'t matter" START abc a bc c END' 
print('\n'.join(re.split(r'^.*?START\s+|\s+END.*?START\s+|\s+END.*?$|\s{2,}',s)[1:-1]))

，因爲我們使用的是^和$啓用re.MULTILINE/re.M標誌。

輸出

some words one 
some words two 
some words three 
abc 
a bc c

來源

2017-10-01 01:46:30 kaza

這很優雅。謝謝。 –

正則表達式與Python：的findall一個boundry內部

回答

相關問題