2017-04-23 134 views
0

我必須在幾個句子中找到並「應用」搭配。這些句子存儲在一個字符串列表中。現在讓我們只關注一個句子。 下面是一個例子:在Python中使用NLTK從bigrams列表中應用搭配

sentence = 'I like to eat the ice cream in new york' 

這就是我想要的到底是:

sentence_final = 'I like to eat the ice_cream in new_york' 

我使用Python NLTK找到搭配和我能夠創建一個包含所有可能的一組在我所有的句子上搭配。 下面是一組的一個例子:

set_collocations = set([('ice', 'cream'), ('new', 'york'), ('go', 'out')]) 

這是在現實中顯然更大。

我創建了下面的函數,它應該返回新的功能,修改如上所述:

def apply_collocations(sentence, set_colloc): 
    window_size = 2 
    words = sentence.lower().split() 
    list_bigrams = list(nltk.bigrams(words)) 
    set_bigrams=set(list_bigrams) 
    intersect = set_bigrams.intersection(set_colloc) 
    print(set_colloc) 
    print(set_bigrams) 
    # No collocation in this sentence 
    if not intersect: 
     return sentence 
    # At least one collocation in this sentence 
    else: 
     set_words_iters = set() 
     # Create set of words of the collocations 
     for bigram in intersect: 
      set_words_iters.add(bigram[0]) 
      set_words_iters.add(bigram[1]) 
     # Sentence beginning 
     if list_bigrams[0][0] not in set_words_iters: 
      new_sentence = list_bigrams[0][0] 
      begin = 1 
     else: 
      new_sentence = list_bigrams[0][0] + '_' + list_bigrams[0][1] 
      begin = 2 

     for i in range(begin, len(list_bigrams)): 
      print(new_sentence) 
      if list_bigrams[i][1] in set_words_iters and list_bigrams[i] in intersect: 
       new_sentence += ' ' + list_bigrams[i][0] + '_' + list_bigrams[i][1] 
      elif list_bigrams[i][1] not in set_words_iters: 
       new_sentence += ' ' + list_bigrams[i][1] 
     return new_sentence 

2問題:

  • 有沒有更優化的方式來這樣做呢?
  • 由於我對NLTK有點不熟悉,有人能告訴我是否有一種「直接方式」將搭配應用於某些文本?我的意思是,一旦我確定了我認爲搭配的兩個bigrams,是否有一些函數(或快速方法)來修改我的句子?

回答

1

您可以簡單地通過 「X_Y」 爲你搭配的每個元素集替換字符串 「x和y」:

def apply_collocations(sentence, set_colloc): 
    res = sentence.lower() 
    for b1,b2 in set_colloc: 
     res = res.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2)) 
    return res