3

在下面的代碼中,我知道我的naivebayes分類器工作正常,因爲它在trainset1上正常工作,但它爲什麼不在trainset2上工作?我甚至嘗試過兩個分類器,一個來自TextBlob,另一個來自nltk。nltk naivebayes分類器的文本分類

from textblob.classifiers import NaiveBayesClassifier 
from textblob import TextBlob 
from nltk.tokenize import word_tokenize 
import nltk 

trainset1 = [('I love this sandwich.', 'pos'), 
('This is an amazing place!', 'pos'), 
('I feel very good about these beers.', 'pos'), 
('This is my best work.', 'pos'), 
("What an awesome view", 'pos'), 
('I do not like this restaurant', 'neg'), 
('I am tired of this stuff.', 'neg'), 
("I can't deal with this", 'neg'), 
('He is my sworn enemy!', 'neg'), 
('My boss is horrible.', 'neg')] 

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'), 
     ('hello i was there and no one came', 'class2'), 
     ('all negative terms like sad angry etc', 'class2')] 

def nltk_naivebayes(trainset, test_sentence): 
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0])) 
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset] 
    classifier = nltk.NaiveBayesClassifier.train(t) 
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words} 
    return classifier.classify(test_sent_features) 

def textblob_naivebayes(trainset, test_sentence): 
    cl = NaiveBayesClassifier(trainset) 
    blob = TextBlob(test_sentence,classifier=cl) 
    return blob.classify() 

test_sentence1 = "he is my horrible enemy" 
test_sentence2 = "inflation soaring limps to anniversary" 

print nltk_naivebayes(trainset1, test_sentence1) 
print nltk_naivebayes(trainset2, test_sentence2) 
print textblob_naivebayes(trainset1, test_sentence1) 
print textblob_naivebayes(trainset2, test_sentence2) 

輸出:

neg 
class2 
neg 
class2 

雖然test_sentence2明顯屬於1類。

回答

3

我會假設你明白,你不能指望一個分類器只用3個例子學習一個好的模型,而你的問題更多的是要理解爲什麼它在這個特定的例子中。

這樣做的可能原因是樸素貝葉斯分類器使用先前的類概率。也就是說,無論文本如何,neg與pos的概率。在你的情況下,2/3的例子是否定的,因此前者的負值爲66%,後者爲33%。在你的單一積極情況下,積極的詞彙是「週年紀念」和「飆升」,這不太可能足以彌補這種先入爲主的概率。

尤其要注意的是,詞概率的計算涉及到各種「平滑」功能(例如,它將在每個類中爲log10(Term Frequency + 1),而不是log10(Term Frequency)以防止低頻詞分類結果過零,等等。因此,「週年紀念」和「飆升」的概率對於負值不爲0.0,對於正值爲1.0,不像您可能預期的那樣。