2016-10-21 248 views
3

的列表,以便我有如下字符串列表:Python的 - 遍歷字符串和組部分匹配字符串

list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"] 

如何通過列表和羣組部分匹配的字符串,而不給出的關鍵字進行迭代。結果如下:

list 1 = [["I love cat","I love dog","I love fish"],["I hate banana","I hate apple","I hate orange"]] 

非常感謝。

+0

你有什麼已經嘗試過?一些入門代碼讓其他人知道你已經嘗試了什麼,以及你陷入困境的地方有助於構建答案。 – TheF1rstPancake

+0

['itertools groupby'](https://docs.python.org/2/library/itertools.html#itertools.groupby)將對此有所幫助。 – RoadRunner

+0

你如何定義部分匹配? – wwii

回答

0

避免在命名變量時使用像list這樣的詞。另外list 1不是有效的python變量。

試試這個:

import sys 
from itertools import groupby 

#Assuming you group by the first two words in each string, e.g. 'I love', 'I hate'. 

L = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"] 

L = sorted(L) 

result = [] 

for key,group in groupby(L, lambda x: x.split(' ')[0] + ' ' + x.split(' ')[1]): 
    result.append(list(group)) 

print(result) 
+2

'''sorted'''會返回一個值,但不會將其賦值給任何東西。也許使用list.sort()來代替就地排序。 – wwii

0

你可以試試這個方法。雖然這不是最好的方法,但它有助於以更有條理的方式來理解問題。

from itertools import groupby 

my_list = ["I love cat","I love dog","I love fish","I hate banana","I hate apple","I hate orange"]; 

each_word = sorted([x.split() for x in my_list]) 

# I assumed the keywords would be everything except the last word 
grouped = [list(value) for key, value in groupby(each_word, lambda x: x[:-1])] 

result = [] 
for group in grouped: 
    temp = [] 
    for i in range(len(group)): 
     temp.append(" ".join(group[i])) 
    result.append(temp) 

print(result) 

輸出:

[['I hate apple', 'I hate banana', 'I hate orange'], ['I love cat', 'I love dog', 'I love fish']] 
+0

您應該確保在使用itertools.groupby()之前對iterable進行排序。 – wwii

+0

是的,這是真的@wwii。感謝您的建議,我會解決這個問題。我也意識到,一半的代碼是沒有必要的,而且可以改進。 – RoadRunner

+0

另外,你認爲什麼是部分匹配? – RoadRunner

3

嘗試建立一個倒排索引,然後你就可以挑你最喜歡的關鍵字。這種方法忽略了詞序:

index = {} 
for sentence in sentence_list: 
    for word in set(sentence.split()): 
     index.setdefault(word, set()).add(sentence) 

或者這種方法,它的鍵索引的所有可能的全字短語的前綴:

index = {} 
for sentence in sentence_list: 
    number_of_words = length(sentence.split()) 
    for i in xrange(1, number_of_words): 
     key_phrase = sentence.rsplit(maxsplit=i)[0] 
     index.setdefault(key_phrase, set()).add(sentence) 

,然後如果你想找到所有包含的句子關鍵字(或啓動一個短語,如果這是你的指數):

match_sentences = index[key_term] 

或者一組給定的關鍵字:

matching_sentences = reduce(list_of_keywords[1:], lambda x, y: x & index[y], initializer = index[list_of_keywords[0]]) 

現在,您可以通過構建使用這些索引生成句子的列表理解來生成幾乎由任何術語或短語組合的列表。例如,如果你建立了短語前綴索引並且希望按照前兩個詞短語分組:

return [list(index[k]) for k in index if len(k.split()) == 2] 
1

序列匹配器將爲你完成任務。調整分數比例以獲得更好的結果。

試試這個:

from difflib import SequenceMatcher 
sentence_list = ["I love cat", "I love dog", "I love fish", "I hate banana", "I hate apple", "I hate orange"] 
result=[] 
for sentence in sentence_list: 
    if(len(result)==0): 
     result.append([sentence]) 
    else: 
     for i in range(0,len(result)): 
      score=SequenceMatcher(None,sentence,result[i][0]).ratio() 
      if(score<0.5): 
       if(i==len(result)-1): 
        result.append([sentence]) 
      else: 
       if(score != 1): 
        result[i].append(sentence) 

輸出:

[['I love cat', 'I love dog', 'I love fish'], ['I hate banana', 'I hate apple', 'I hate orange']]