在Python中計算n-gram頻率nltk

我有以下代碼。我知道我可以使用apply_freq_filter函數來濾除少於頻率計數的搭配。然而，在我決定爲過濾設置什麼頻率之前，我不知道如何獲取文檔中所有n元組元組的頻率（在我的例子中是雙元組）。正如你所看到的，我正在使用nltk collocations類。在Python中計算n-gram頻率nltk

import nltk 
from nltk.collocations import * 
line = "" 
open_file = open('a_text_file','r') 
for val in open_file: 
    line += val 
tokens = line.split() 

bigram_measures = nltk.collocations.BigramAssocMeasures() 
finder = BigramCollocationFinder.from_words(tokens) 
finder.apply_freq_filter(3) 
print finder.nbest(bigram_measures.pmi, 100)

來源

2013-01-16 Rkz

您是否嘗試過'finder.ngram_fd.viewitems（）'？ –

謝謝finder.ngram_fd.viewitems（）的作品！ – Rkz

的finder.ngram_fd.viewitems()功能工作

來源

2013-01-21 01:22:09 Rkz

NLTK有它自己bigrams generator，以及方便的FreqDist()功能。

f = open('a_text_file') 
raw = f.read() 

tokens = nltk.word_tokenize(raw) 

#Create your bigrams 
bgs = nltk.bigrams(tokens) 

#compute frequency distribution for all the bigrams in the text 
fdist = nltk.FreqDist(bgs) 
for k,v in fdist.items(): 
    print k,v

一旦你獲得了雙字母組和頻率分佈，您可以根據自己的需要進行篩選。

希望有所幫助。

來源

2013-01-19 10:05:38

這使我用'File「/usr/local/lib/python3.6/site-packages/nltk/util.py」，第467行，在ngrams ，而n> 1： TypeError：'>'不支持'str'和'int'的實例 – m02ph3u5

from nltk import FreqDist 
from nltk.util import ngrams  
def compute_freq(): 
    textfile = open('corpus.txt','r') 

    bigramfdist = FreqDist() 
    threeramfdist = FreqDist() 

    for line in textfile: 
     if len(line) > 1: 
     tokens = line.strip().split(' ') 

     bigrams = ngrams(tokens, 2) 
     bigramfdist.update(bigrams) 
compute_freq()

來源

2018-03-08 18:02:56 Vahab

只是在'if'之後插入縮進;代碼工程如果python 3.5 – Vahab

在Python中計算n-gram頻率nltk

回答

相關問題