我現在通過使用nltk來學習naivebayes分類器。如何在nltk naivebayes分類器中添加頻率?
在文檔(http://www.nltk.org/book/ch06.html)1.3文檔分類中,有一個特徵集示例。
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
所以featuresets的形式的例子是{( '包含(廢物)':假, '包含(很多)':虛假,...}, '負')...}
但是我想從改變詞典的形式 '包含(廢物)':假到'包含(廢物)':2。我認爲這種形式('包含(浪費)':2)很好地解釋文件,因爲它可以計算世界的頻率。因此,功能集將{( '包含(廢物)':2, '包含(很多)':5,...}, '負')...}
但我擔心'是否含有(浪費)':2和'contains(waste)':1是與naivebayesclassifier完全不同的詞。那麼它不能解釋'contains(waste)'的相似性:2和'contains(waste)':1。
{ '包含(批號)':1和 '包含(廢物)':1}和{ '含有(廢物)':2和 '包含(廢物)':1}可以與程序相同。
nltk.naivebayesclassifier能理解單詞的頻率嗎?
這是我用
def split_and_count_word(data):
#belongs_to : Main
#Role : make featuresets from korean words using konlpy.
#Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
#Return : list featuresets([{'word':True',...},'politic'] == featureset + category)
featuresets = []
twitter = konlpy.tag.Twitter()#Korean word splitter
for big_cat in data:
for small_cat in data[big_cat]:
#save category name needed in featuresets
category = str(big_cat[0:3])+'/'+str(small_cat)
count = 0; print(small_cat)
for one_news in data[big_cat][small_cat]:
count+=1; if count%100==0: print(count,end=' ')
#one_news is list in list so open it!
doc = one_news
#split word as using konlpy
list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences.
#get word length is higher than two and get list of splited words
list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
dict_of_featuresets = make_featuresets(list_of_up_two_word)
#save
featuresets.append((dict_of_featuresets,category))
return featuresets
def make_featuresets(data):
#belongs_to : split_and_count_word
#Role : make featuresets
#Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
#Return : dictionary {word : True for word in data}
#PROBLEM :(
#cannot consider the freqency of word
return {word : True for word in data}
def naive_train(featuresets):
#belongs_to : Main
#Role : Learning by naive bayes rule
#Parameter : list featuresets([{'word':True',...},'pol/pal'])
#Return : object classifier(nltk naivebayesclassifier object),
# list test_set(the featuresets that are randomly selected)
random.shuffle(featuresets)
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = naivebayes.NaiveBayesClassifier.train(train_set)
return classifier,test_set
featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)
謝謝你給我的想法。那麼你的意思是我不能添加已經包含在特徵字典中的單詞?例如,字典將是{**「hello」:True,「hello」:True **,「my」:True ...}。那麼,你能推薦其他有用的機器學習模塊嗎? – dizwe
正如您在對@ aberger的評論中已經指出的那樣,不可以在字典中兩次使用相同的密鑰。不能直接指向你量化的解決方案,抱歉。 nltk的['MaxentClassifier'](http://www.nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifier)使用數字權重,但它們通常由API根據「名義值「你提供的功能;所以你不得不四處尋找正確的方式來使用它。還請看scikit-learn。最好的分類器取決於你的任務,所以試試幾個! – alexis
謝謝,我會嘗試! – dizwe