2016-06-30 20 views
0

我想從第二個列表中的項目(完整句子)與一個列表匹配項目(單個單詞)。這是我的代碼:使用for循環匹配列表中最短的子字符串

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     if word in line: 
      print(word,line) 

現在的問題是,我的代碼輸出子,所以在尋找一個'Python的發生,我也越來越「蟒蛇」的句子時;同樣,當我只想要包含單詞'有趣'的句子時,我越來越'有趣'。

我已經嘗試在列表中的單詞旁邊添加空格,但這不是理想的解決方案,因爲句子可能包含標點符號,並且代碼不會返回匹配項。

所需的輸出:
- 時間,時間就是高
- 趣味,這就是樂趣!
- Python,Python不錯

+0

'Fun'和''樂趣是明顯不一樣 –

回答

0

既然你要精確匹配,它會更好地使用==,而不是in

import string 

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

for word in tokens: 
    for line in sentences: 
     for wrd in line.split(): 
      if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd 
       print(word,line) 
0

它不是那麼容易(需要更多的代碼行)來實現檢索「Fun!」對於Fun,同時對於Python不是「蟒蛇」。當然可以這樣做,但是在這一點上我的規則不是很清楚。看看這雖然:

tokens = ['Time', 'Fun', 'Python'] 
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()]) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')] 

下面你只有這一次你用好老for循環同樣的事情,而不是一個列表理解。我雖然可能會幫助你理解上面的代碼更容易。

a = [] 
for phrase in sentences: 
    words_in_phrase = phrase.split() 
    for words in tokens: 
     if words in words_in_phrase: 
      a.append((words, phrase)) 
print(a) 
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')] 

這裏發生的事情是,代碼返回找到的字符串和它在其中找到的短語。這樣做的方式是將sentence列表中的短語拆分爲空白。所以「Pythons」和「Python」與你想要的不一樣,但「Fun!」也是如此。和樂趣」。這也是區分大小寫的。

0

您可能想要使用動態生成的正則表達式,即對於「Python」,正則表達式看起來像'\ bPython \ b'。 '\ b'是一個字邊界。

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 

import re 
for word in tokens: 
    regexp = re.compile('\b' + word + '\b') 
    for line in sentences: 
     if regexp.match(line): 
      print(line) 
      print(word,line) 
0

標記句子更好,然後按空格拆分它,因爲標記化將分隔標點符號。

例如:

sentence = 'this is a test.' 
>>> 'test' in 'this is a test.'.split(' ') 
False 
>>> nltk.word_tokenize('this is a test.') 
['this', 'is', 'a', 'test','.'] 

代碼:

tokens=['Time','Fun','Python'] 
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"] 
import nltk 
for sentence in sentences: 
    for token in tokens: 
     if token in nltk.word_tokenize(sentence): 
      print token,sentence 
+0

爲什麼你的代碼的工作!?考慮在答案中添加一些上下文。 – ppperry