2014-01-08 48 views
1

我有話格式的正則表達式在Python

wordlist = ['hypothesis' , 'test' , 'results' , 'total'] 

的名單上有一句話

sentence = "These tests will benefit in the long run." 

我要檢查,看看是否在wordlist的詞在句子。我知道,你可以檢查,看看他們是否正在使用中的一句話子:

for word in wordlist: 
    if word in sentence: 
     print word 

但是,使用子,我開始匹配不在wordlist的話,例如這裏test將顯示爲一個子即使它是句子中的tests。我可以通過使用正則表達式來解決我的問題,但是,是否可以通過用每個新單詞格式化的方式實現正則表達式,這意味着如果我想查看該單詞是否在句子中,則:

for some_word_goes_in_here in wordlist: 
    if re.search('.*(some_word_goes_in_here).*', sentence): 
     print some_word_goes_in_here 

所以在這種情況下,正則表達式會將some_word_goes_in_here解釋爲需要搜索的模式,而不是some_word_goes_in_here的值。有沒有一種方法來格式化輸入some_word_goes_in_here,以便正則表達式搜索some_word_goes_in_here的值?

+0

如果你有更好的溶膠我渴望聽到它。 – kolonel

回答

1

嘗試使用:

if re.search(r'\b' + word + r'\b', sentence): 

\b字界限,將你的話和非單詞字符之間的匹配(單詞字符是任何字母,數字或下劃線)。

例如,

>>> import re 
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total'] 
>>> sentence = "The total results for the test confirm the hypothesis" 
>>> for word in wordlist: 
...  if re.search(r'\b' + word + r'\b', sentence): 
...    print word 
... 
hypothesis 
test 
results 
total 

隨着你的字符串:

>>> sentence = "These tests will benefit in the long run." 
>>> for word in wordlist: 
...  if re.search(r'\b' + word + r'\b', sentence): 
...   print word 
... 
>>> 

什麼也沒有打印

+0

謝謝。是的,但在這種情況下,沒有什麼應該匹配。 – kolonel

+1

@kolonel我使用了一個不同的字符串,但讓我把你的一點點 – Jerry

+2

不要使用'list'作爲變量名,掩蓋默認類型.. –

2

使用\b字邊界來測試的話:

for word in wordlist: 
    if re.search(r'\b{}\b'.format(re.escape(word)), sentence): 
     print '{} matched'.format(word) 

但你也可以把這個句子分成單獨的單詞。使用一組單詞列表將讓測試更有效率:

words = set(wordlist) 
if words.intersection(sentence.split()): 
    # no looping over `words` required. 

演示:

>>> import re 
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total'] 
>>> sentence = "These tests will benefit in the long run." 
>>> for word in wordlist: 
...  if re.search(r'\b{}\b'.format(re.escape(word)), sentence): 
...   print '{} matched'.format(word) 
... 
>>> words = set(wordlist) 
>>> words.intersection(sentence.split()) 
set([]) 
>>> sentence = 'Lets test this hypothesis that the results total the outcome' 
>>> for word in wordlist: 
...  if re.search(r'\b{}\b'.format(re.escape(word)), sentence): 
...   print '{} matched'.format(word) 
... 
hypothesis matched 
test matched 
results matched 
total matched 
>>> words.intersection(sentence.split()) 
set(['test', 'total', 'hypothesis', 'results']) 
+0

我正在考慮使用're.escape'並決定反對它,因爲_words_不需要轉義。在更一般的情況下,這是一個很好的建議。 – Alfe

+0

@MartijnPieters謝謝。 – kolonel

+0

@MartjinPieters我認爲將句子拆分成單詞可能會引入錯誤,因爲找到單詞之間的界限並不是一項簡單的任務。 – kolonel

1

我會使用這樣的:

words = "hypothesis test results total".split() 
# ^^^ but you can use your literal list if you prefer that 
for word in words: 
    if re.search(r'\b%s\b' % (word,), sentence): 
    print word 

您甚至可以通過加快這使用單個正則表達式:

for foundWord in re.findall(r'\b' + r'\b|\b'.join(words) + r'\b', sentence): 
    print foundWord 
+0

感謝您的解決方案。 – kolonel