2013-07-04 78 views
5

我正在尋找一個關於如何運行Multinomial樸素貝葉斯分類器的簡單示例。我碰到這個例子從StackOverflow的:使用Python對Multinomial樸素貝葉斯分類器進行分類示例

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

import numpy as np 
from nltk.probability import FreqDist 
from nltk.classify import SklearnClassifier 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_selection import SelectKBest, chi2 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import Pipeline 

pipeline = Pipeline([('tfidf', TfidfTransformer()), 
        ('chi2', SelectKBest(chi2, k=1000)), 
        ('nb', MultinomialNB())]) 
classif = SklearnClassifier(pipeline) 

from nltk.corpus import movie_reviews 
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] 
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] 
add_label = lambda lst, lab: [(x, lab) for x in lst] 
#Original code from thread: 
#classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg')) 
classif.train(add_label(pos, 'pos') + add_label(neg, 'neg'))#Made changes here 

#Original code from thread:  
#l_pos = np.array(classif.batch_classify(pos[100:])) 
#l_neg = np.array(classif.batch_classify(neg[100:])) 
l_pos = np.array(classif.batch_classify(pos))#Made changes here 
l_neg = np.array(classif.batch_classify(neg))#Made changes here 
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
      (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), 
      (l_neg == 'pos').sum(), (l_neg == 'neg').sum()) 

我運行這個例子後收到一個警告。

C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn\feature_selection\univariate_selection.py:327: 
UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, 
or you used a classification score for a regression task. 
warn("Duplicate scores. Result may depend on feature ordering." 

Confusion matrix: 
876 124 
63 937 

所以,我的問題是..

  1. 誰能告訴我這是什麼錯誤消息意味着什麼?
  2. 我對原始代碼做了一些更改,但爲什麼混淆矩陣的結果比原始線程中的要高得多呢?
  3. 如何測試此分類器的精度?

回答

2

原始代碼在正面和負面的前100個例子上訓練,然後對其餘部分進行分類。您已經刪除了邊界,並在訓練和分類階段都使用了每個示例,換句話說,您具有重複的功能。爲了解決這個問題,將數據集分成兩組,分別進行訓練和測試。

混淆矩陣更高(或不同),因爲您正在訓練不同的數據。

混淆矩陣是精確的測量和顯示誤報的數量等在這裏閱讀更多:http://en.wikipedia.org/wiki/Confusion_matrix

+0

如果有幫助,請[接受答案](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235) – Spaceghost

1

我用原來的代碼僅用於訓練集第100條,仍然有這樣的警告。我的輸出是:

In [6]: %run testclassifier.py 
C:\Users\..\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\feature_selection\univariate_selecti 
on.py:319: UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, o 
r you used a classification score for a regression task. 
    warn("Duplicate scores. Result may depend on feature ordering." 
Confusion matrix: 
427  473 
132  768 
相關問題