爲什麼這不是一個固定的寬度模式？

我試圖拆正確的英語句子，我想出了下面的正則表達式的邪惡：爲什麼這不是一個固定的寬度模式？

(?<!\d|([A-Z]\.)|(\.[a-z]\.)|(\.\.\.)|etc\.|[Pp]rof\.|[Dd]r\.|[Mm]rs\.|[Mm]s\.|[Mm]z\.|[Mm]me\.)(?<=([\.!?])|(?<=([\.!?][\'\"])))[\s]+?(?=[\S])'

的問題是，Python會提高以下錯誤：


Traceback (most recent call last): 
    File "", line 1, in 
    File "sp.py", line 55, in analyze 
    self.sentences = re.split(god_awful_regex, self.inputstr.strip()) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 165, in split 
    return _compile(pattern, 0).split(string, maxsplit) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile 
    raise error, v # invalid expression 
sre_constants.error: look-behind requires fixed-width pattern

這是爲什麼不是一個有效的，固定寬度的正則表達式？我沒有使用任何重複字符（*或+），只是|。

編輯 @Anomie解決了這個問題 - 感謝一噸！不幸的是，我無法做出新的表達平衡：

(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])

是我現在擁有的。的（數量的匹配的數量（的，雖然：？

>>> god_awful_regex = r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])''' 
>>> god_awful_regex.count('(') 
17 
>>> god_awful_regex.count(')') 
17 
>>> god_awful_regex.count('[') 
13 
>>> god_awful_regex.count(']') 
13

任何更多的想法

來源

2011-03-16 Peter

我不知道，但也許是因爲[Pp] rof = 4個字符，而[Mm] rs = 3個字符？ – orlp 2011-03-16 23:53:05

關於不平衡的圓括號：乍一看，問題似乎是在你的正則表達式結尾處，你錯誤地逃過了字符類的右括號，從而使得括號中的圓括號成爲其實際函數的一部分。在其他情況下，你也逃脫了更多的必要。試試這個：'r'''（？<！（\ d））（？<！[AZ] \。）（？<！\。[az] \。）（？<！（\。\。\。））（？<！等等\。）（？<！[PP] ROF \。）（？<！[DD] r \。）（？<！[mm]的RS \。）（？<！[mm]的：S \）（<[MM】Z \）（<[MM]我\）（：？！？！？[。！？]（<=）|？[。！？]（<= [ '']]）[\ s] +？（？= [\ S]）'''' – 2011-03-17 07:35:54

此外，您可能希望通過使其不區分大小寫來簡化您的正則表達式（使用're.I'選項編譯它） – 2011-03-17 07:37:38

考慮一下這個子表達式：

(?<=([\.!?])|(?<=([\.!?][\'\"])))

的左側|是一個字符，而正確的大小是零。在較大的負面後顧之中，您也有同樣的問題，可能是1，2，3，4或5個字符。

從邏輯上說，(?<!A|B|C)的負面後顧應該相當於一系列的後視(?<!A)(?<!B)(?<!C)。 (?<=A|B|C)的正面後視應等於(?:(?<=A)|(?<=B)|(?<=C))。

來源

2011-03-17 00:08:29 Anomie

-1

看起來你可能會使用接近尾聲重複chacters：

[\s]+?

除非我讀的是錯的。

UPDATE

或垂直欄作爲爆竹提及，並且這個問題的第一個答案似乎證實：determine if regular expression only matches fixed-length strings

來源

2011-03-16 23:53:16 ctcherry

是的，但是因爲它是在後面的後面，它不應該影響它。 – orlp 2011-03-16 23:53:58

正如爆竹說的那樣，「OR」豎條允許不同長度的字符串匹配，也許這很重要？ – ctcherry 2011-03-17 00:07:38

根據這個問題的第一個答案：http://stackoverflow.com/questions/3627570/determine-if-regular-expression-only-matches-fixed-length-strings豎杆可能是罪魁禍首 – ctcherry 2011-03-17 00:09:12

這並不回答你的問題。但是，如果您想將文本拆分爲句子，則可能需要查看nltk，其中包括許多其他內容PunktSentenceTokenizer。下面是一些示例性令牌化：

""" PunktSentenceTokenizer 

A sentence tokenizer which uses an unsupervised algorithm to build a model 
for abbreviation words, collocations, and words that start sentences; and then 
uses that model to find sentence boundaries. This approach has been shown to 
work well for many European languages. """ 

from nltk.tokenize.punkt import PunktSentenceTokenizer 

tokenizer = PunktSentenceTokenizer() 
print tokenizer.tokenize(__doc__) 

# [' PunktSentenceTokenizer\n\nA sentence tokenizer which uses an unsupervised 
# algorithm to build a model\nfor abbreviation words, collocations, and words 
# that start sentences; and then\nuses that model to find sentence boundaries.', 
# 'This approach has been shown to\nwork well for many European languages. ']

來源

2011-03-17 00:03:29 miku

爲什麼這不是一個固定的寬度模式？

回答

相關問題