2014-01-16 89 views
8

我知道如何使用NLTK獲取bigram和trigram搭配,並將它們應用於我自己的語料庫。代碼如下。特定詞的NLTK搭配

但我不確定(1)如何獲得特定單詞的搭配? (2)NLTK是否具有基於對數似然比的搭配度量?

import nltk 
from nltk.collocations import * 
from nltk.tokenize import word_tokenize 

text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" 

trigram_measures = nltk.collocations.TrigramAssocMeasures() 
finder = TrigramCollocationFinder.from_words(word_tokenize(text)) 

for i in finder.score_ngrams(trigram_measures.pmi): 
    print i 

回答

9

試試這個代碼:

import nltk 
from nltk.collocations import * 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# Ngrams with 'creature' as a member 
creature_filter = lambda *w: 'creature' not in w 


## Bigrams 
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only bigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.likelihood_ratio, 10) 


## Trigrams 
finder = TrigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only trigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only trigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(trigram_measures.likelihood_ratio, 10) 

它使用的可能性的措施,並篩選出不包含這個詞「生物」

的n-gram
2

問題1 - 嘗試:

target_word = "electronic" # your choice of word 
finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3)) 
for i in finder.score_ngrams(trigram_measures.likelihood_ratio): 
print i 

的想法是過濾掉你不想要的。這種方法通常用於過濾ngram中特定部分的單詞,並且可以根據您的內容調整它。