2014-01-20 22 views
0

我試圖過濾常見單詞以城市名稱結尾。從字符串中刪除常用單詞?

這是我有:

import re 
ask = "What's the weather like in Lexington, SC?" 
REMOVE_LIST = ["like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"] 
remove = '|'.join(REMOVE_LIST) 
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE) 
out = regex.sub("", ask) 

它輸出:

nothing to repeat 

回答

1
[x for x in ask.split() if x.lower() not in REMOVE_LIST] 
1

你應該逃避字符串字面匹配,因爲一些字符有特殊意義的正則表達式(例如?REMOVE_LIST):

使用re.escape逃脫這樣的字符:

>>> import re 
>>> re.escape('?') 
'\\?' 

>>> re.search('?', 'Lexington?') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "C:\Python27\lib\re.py", line 142, in search 
    return _compile(pattern, flags).search(string) 
    File "C:\Python27\lib\re.py", line 242, in _compile 
    raise error, v # invalid expression 
sre_constants.error: nothing to repeat 
>>> re.search(r'\?', 'Lexington?') 
<_sre.SRE_Match object at 0x0000000002C68100> 
>>> 

>>> import re 
>>> ask = "What's the weather like in Lexington, SC?" 
>>> REMOVE_LIST = ["like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"] 
>>> remove = '|'.join(map(re.escape, REMOVE_LIST)) 
>>> regex = re.compile(r'\b(' + remove + r')\b', flags=re.IGNORECASE) 
>>> out = regex.sub("", ask) 
>>> print out 
    Lexington, SC? 
0

使用正則表達式來找到的話:

import re 

sentence = "What's the weather like in Lexington, SC?" 
words = re.findall(r"[\w']+", sentence.lower()) 
remove = {"like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"} 

print set(words) - remove 

集合是無序的,因此,如果順序很重要,你可以過濾列表具有列表理解的詞語:

[word for word in words if word not in remove]