如何在nltk naivebayes分類器中添加頻率？

我現在通過使用nltk來學習naivebayes分類器。如何在nltk naivebayes分類器中添加頻率？

在文檔（http://www.nltk.org/book/ch06.html）1.3文檔分類中，有一個特徵集示例。

featuresets = [(document_features(d), c) for (d,c) in documents] 
train_set, test_set = featuresets[100:], featuresets[:100] 
classifier = nltk.NaiveBayesClassifier.train(train_set) 

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) 
word_features = list(all_words)[:2000] [1] 

def document_features(document): [2] 
    document_words = set(document) [3] 
    features = {} 
    for word in word_features: 
     features['contains({})'.format(word)] = (word in document_words) 
    return features

所以featuresets的形式的例子是{（ '包含（廢物）'：假， '包含（很多）'：虛假，...}， '負'）...}

但是我想從改變詞典的形式 '包含（廢物）'：假到'包含（廢物）'：2。我認爲這種形式（'包含（浪費）'：2）很好地解釋文件，因爲它可以計算世界的頻率。因此，功能集將{（ '包含（廢物）'：2， '包含（很多）'：5，...}， '負'）...}

但我擔心'是否含有（浪費）'：2和'contains（waste）'：1是與naivebayesclassifier完全不同的詞。那麼它不能解釋'contains（waste）'的相似性：2和'contains（waste）'：1。

{ '包含（批號）'：1和 '包含（廢物）'：1}和{ '含有（廢物）'：2和 '包含（廢物）'：1}可以與程序相同。

nltk.naivebayesclassifier能理解單詞的頻率嗎？

這是我用

def split_and_count_word(data): 
    #belongs_to : Main 
    #Role : make featuresets from korean words using konlpy. 
    #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..}) 
    #Return : list featuresets([{'word':True',...},'politic'] == featureset + category) 

    featuresets = [] 
    twitter = konlpy.tag.Twitter()#Korean word splitter 

    for big_cat in data: 

     for small_cat in data[big_cat]: 
      #save category name needed in featuresets 
      category = str(big_cat[0:3])+'/'+str(small_cat) 
      count = 0; print(small_cat) 

      for one_news in data[big_cat][small_cat]: 
       count+=1; if count%100==0: print(count,end=' ')     
       #one_news is list in list so open it! 
       doc = one_news 
       #split word as using konlpy 
       list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. 
       #get word length is higher than two and get list of splited words 
       list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1] 
       dict_of_featuresets = make_featuresets(list_of_up_two_word) 
       #save 
       featuresets.append((dict_of_featuresets,category)) 

    return featuresets 


def make_featuresets(data): 
    #belongs_to : split_and_count_word 
    #Role : make featuresets 
    #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다'] 
    #Return : dictionary {word : True for word in data} 

    #PROBLEM :(
    #cannot consider the freqency of word 
    return {word : True for word in data} 

def naive_train(featuresets): 
    #belongs_to : Main 
    #Role : Learning by naive bayes rule 
    #Parameter : list featuresets([{'word':True',...},'pol/pal']) 
    #Return : object classifier(nltk naivebayesclassifier object), 
    #   list test_set(the featuresets that are randomly selected) 

    random.shuffle(featuresets) 
    train_set, test_set = featuresets[1000:], featuresets[:1000] 
    classifier = naivebayes.NaiveBayesClassifier.train(train_set) 

    return classifier,test_set 

featuresets = split_and_count_word(data) 
classifier,test_set = naive_train(featuresets)

來源

2016-10-20 dizwe

的NLTK的樸素貝葉斯分類對待特徵值作爲邏輯上不同的代碼。數值不限於True和False，但它們不會被視爲數量。如果您有功能f=2和f=3，它們會計爲不同的值。將數量添加到這種模型的唯一方法是將它們分類爲例如f=1,f="few"（2-5），f="several"（6-10），f="many"（11或更多）的「桶」。（注意：如果你走這條路線，有一些算法可以爲水桶選擇好的數值範圍。）即使這樣，模型也不會「知道」「一」和「幾」之間的「少數」。您需要一個不同的機器學習工具來直接處理數量。

來源

2016-11-13 20:21:43 alexis

謝謝你給我的想法。那麼你的意思是我不能添加已經包含在特徵字典中的單詞？例如，字典將是{**「hello」：True，「hello」：True **，「my」：True ...}。那麼，你能推薦其他有用的機器學習模塊嗎？ – dizwe

正如您在對@ aberger的評論中已經指出的那樣，不可以在字典中兩次使用相同的密鑰。不能直接指向你量化的解決方案，抱歉。 nltk的['MaxentClassifier']（http://www.nltk.org/api/nltk.classify.html#nltk.classify.maxent.MaxentClassifier）使用數字權重，但它們通常由API根據「名義值「你提供的功能;所以你不得不四處尋找正確的方式來使用它。還請看scikit-learn。最好的分類器取決於你的任務，所以試試幾個！ – alexis

謝謝，我會嘗試！ – dizwe

如何在nltk naivebayes分類器中添加頻率？

回答

相關問題