使用for循環匹配列表中最短的子字符串

我想從第二個列表中的項目（完整句子）與一個列表匹配項目（單個單詞）。這是我的代碼：使用for循環匹配列表中最短的子字符串

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     if word in line: 
      print(word,line)

現在的問題是，我的代碼輸出子，所以在尋找一個'Python的發生，我也越來越「蟒蛇」的句子時;同樣，當我只想要包含單詞'有趣'的句子時，我越來越'有趣'。

我已經嘗試在列表中的單詞旁邊添加空格，但這不是理想的解決方案，因爲句子可能包含標點符號，並且代碼不會返回匹配項。

所需的輸出：
- 時間，時間就是高
- 趣味，這就是樂趣！
- Python，Python不錯

來源

2016-06-30 Elle

'Fun'和''樂趣是明顯不一樣 –

既然你要精確匹配，它會更好地使用==，而不是in。

import string 

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     for wrd in line.split(): 
      if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd 
       print(word,line)

來源

2016-06-30 13:02:31 Lafexlos

它不是那麼容易（需要更多的代碼行）來實現檢索「Fun！」對於Fun，同時對於Python不是「蟒蛇」。當然可以這樣做，但是在這一點上我的規則不是很清楚。看看這雖然：

tokens = ['Time', 'Fun', 'Python'] 
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()]) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

下面你只有這一次你用好老for循環同樣的事情，而不是一個列表理解。我雖然可能會幫助你理解上面的代碼更容易。

a = [] 
for phrase in sentences: 
    words_in_phrase = phrase.split() 
    for words in tokens: 
     if words in words_in_phrase: 
      a.append((words, phrase)) 
print(a) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]

這裏發生的事情是，代碼返回找到的字符串和它在其中找到的短語。這樣做的方式是將sentence列表中的短語拆分爲空白。所以「Pythons」和「Python」與你想要的不一樣，但「Fun！」也是如此。和樂趣」。這也是區分大小寫的。

來源

2016-06-30 12:53:30

您可能想要使用動態生成的正則表達式，即對於「Python」，正則表達式看起來像'\ bPython \ b'。 '\ b'是一個字邊界。

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

import re 
for word in tokens: 
    regexp = re.compile('\b' + word + '\b') 
    for line in sentences: 
     if regexp.match(line): 
      print(line) 
      print(word,line)

來源

2016-06-30 13:00:23

標記句子更好，然後按空格拆分它，因爲標記化將分隔標點符號。

例如：

sentence = 'this is a test.' 
>>> 'test' in 'this is a test.'.split(' ') 
False 
>>> nltk.word_tokenize('this is a test.') 
['this', 'is', 'a', 'test','.']

代碼：

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 
import nltk 
for sentence in sentences: 
    for token in tokens: 
     if token in nltk.word_tokenize(sentence): 
      print token,sentence

來源

2016-06-30 13:40:05 galaxyan

爲什麼你的代碼的工作！？考慮在答案中添加一些上下文。 – ppperry

使用for循環匹配列表中最短的子字符串

回答

相關問題