使用Python對Multinomial樸素貝葉斯分類器進行分類示例

我正在尋找一個關於如何運行Multinomial樸素貝葉斯分類器的簡單示例。我碰到這個例子從StackOverflow的：使用Python對Multinomial樸素貝葉斯分類器進行分類示例

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

import numpy as np 
from nltk.probability import FreqDist 
from nltk.classify import SklearnClassifier 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_selection import SelectKBest, chi2 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import Pipeline 

pipeline = Pipeline([('tfidf', TfidfTransformer()), 
        ('chi2', SelectKBest(chi2, k=1000)), 
        ('nb', MultinomialNB())]) 
classif = SklearnClassifier(pipeline) 

from nltk.corpus import movie_reviews 
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] 
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] 
add_label = lambda lst, lab: [(x, lab) for x in lst] 
#Original code from thread: 
#classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg')) 
classif.train(add_label(pos, 'pos') + add_label(neg, 'neg'))#Made changes here 

#Original code from thread:  
#l_pos = np.array(classif.batch_classify(pos[100:])) 
#l_neg = np.array(classif.batch_classify(neg[100:])) 
l_pos = np.array(classif.batch_classify(pos))#Made changes here 
l_neg = np.array(classif.batch_classify(neg))#Made changes here 
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
      (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), 
      (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

我運行這個例子後收到一個警告。

C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn\feature_selection\univariate_selection.py:327: 
UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, 
or you used a classification score for a regression task. 
warn("Duplicate scores. Result may depend on feature ordering." 

Confusion matrix: 
876 124 
63 937

所以，我的問題是..

誰能告訴我這是什麼錯誤消息意味着什麼？
我對原始代碼做了一些更改，但爲什麼混淆矩陣的結果比原始線程中的要高得多呢？
如何測試此分類器的精度？

來源

2013-07-04 Cryssie

原始代碼在正面和負面的前100個例子上訓練，然後對其餘部分進行分類。您已經刪除了邊界，並在訓練和分類階段都使用了每個示例，換句話說，您具有重複的功能。爲了解決這個問題，將數據集分成兩組，分別進行訓練和測試。

混淆矩陣更高（或不同），因爲您正在訓練不同的數據。

混淆矩陣是精確的測量和顯示誤報的數量等在這裏閱讀更多：http://en.wikipedia.org/wiki/Confusion_matrix

來源

2013-07-05 21:34:38 Spaceghost

如果有幫助，請[接受答案]（http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235） – Spaceghost

我用原來的代碼僅用於訓練集第100條，仍然有這樣的警告。我的輸出是：

In [6]: %run testclassifier.py 
C:\Users\..\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\feature_selection\univariate_selecti 
on.py:319: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, o 
r you used a classification score for a regression task. 
    warn("Duplicate scores. Result may depend on feature ordering." 
Confusion matrix: 
427  473 
132  768

來源

2013-11-11 19:56:54

使用Python對Multinomial樸素貝葉斯分類器進行分類示例

回答

相關問題