python正則表達式可否定單詞列表嗎？

我必須匹配文本中的所有字母數字單詞。python正則表達式可否定單詞列表嗎？

>>> import re 
>>> text = "hello world!! how are you?" 
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text) 
>>> final_list 
['hello', 'world', 'how', 'are', 'you'] 
>>>

這很好，但我進一步否定了不應該在我的最終名單中的單詞。

>>> negate_words = ['world', 'other', 'words']

一個糟糕的方式做到這一點

>>> negate_str = '|'.join(negate_words) 
>>> filter(lambda x: not re.match(negate_str, x), final_list) 
['hello', 'how', 'are', 'you']

但我可以節省一個循環，如果我的第一個正則表達式模式是可以改變的考慮的那些話否定。我發現否定字符，但我有話否定，也發現正則表達式在其他問題，但這也沒有幫助。

是否可以使用python re？

更新

我的文字可以跨越幾個hundered線。此外，negate_words列表也可能很長。

考慮到這一點，正在使用正則表達式來處理這樣的任務，正確的處於第一位？有什麼建議？

來源

2011-11-30 simplyharsh

有很多'negate_words'的？ –

@bitsMiz是的，可以有很多否定詞。文本也可以跨越很少的線條。 – simplyharsh

我不認爲有一個乾淨的方式來使用正則表達式來做到這一點。我能找到的最接近的是有點難看，並不完全是你想要的：

>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text) 
['hello', '', 'how', 'are', 'you']

爲什麼不使用Python的集合。它們非常快：

>>> list(set(final_list) - set(negate_words)) 
['hello', 'how', 'are', 'you']

如果訂單很重要，請參閱下面的@glglgl回覆。他的列表理解版本非常易讀。下面是使用itertools快速但不可讀相當於：

>>> negate_words_set = set(negate_words) 
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list)) 
['hello', 'how', 'are', 'you']

另一種選擇是在單次使用re.finditer積聚的單詞列表：

>>> result = [] 
>>> negate_words_set = set(negate_words) 
>>> result = [] 
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text): 
    word = mo.group() 
    if word not in negate_words_set: 
     result.append(word) 

>>> result 
['hello', 'how', 'are', 'you']

來源

2011-11-30 09:09:09

值得一提的是，詞序將會丟失。 – DrTyrsa

'[我爲我在final_list如果我不在negate_words_set]' – glglgl

@raymond，啊！你確定嗎？但無論如何，我可以絕對用你提到的set來代替我的過濾函數。 – simplyharsh

也許這是值得嘗試pyparsing：

>>> from pyparsing import * 

>>> negate_words = ['world', 'other', 'words'] 
>>> parser = OneOrMore(Suppress(oneOf(negate_words))^Word(alphanums)).ignore(CharsNotIn(alphanums)) 
>>> parser.parseString('hello world!! how are you?').asList() 
['hello', 'how', 'are', 'you']

注意oneOf(negate_words)必須Word(alphanums)之前，爲了確保它早些時候匹配。

編輯：只是爲了好玩，我重複使用lepl（也是一個有趣的解析庫）行使

>>> from lepl import * 

>>> negate_words = ['world', 'other', 'words'] 
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any()) 
>>> parser.parse('hello world!! how are you?') 
['hello', 'how', 'are', 'you']

來源

2011-11-30 09:46:44 jcollado

不要問無謂過多的正則表達式。
相反，想想發電機。

import re 

unwanted = ('world', 'other', 'words') 

text = "hello world!! how are you?" 

gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text)) 
li = [ w for w in gen if w not in unwanted ]

和發電機可以被創建，而不是李，也

來源

2011-11-30 14:05:34 eyquem

python正則表達式可否定單詞列表嗎？

回答

相關問題