2016-02-12 35 views
2

我正在使用sklearn多項樸素貝葉斯分類器對20NewsGroup數據進行分類。代碼如下:sklearn MultinomialNB如何在類中找到最多區分詞

import numpy as np 
import operator 
from sklearn import datasets, naive_bayes, metrics, feature_extraction 

data_train = datasets.fetch_20newsgroups(subset = 'train', shuffle = True, random_state = 2016, remove = ('headers', 'footers', 'quotes')) 
data_test = datasets.fetch_20newsgroups(subset = 'test', shuffle = True, random_state = 2016, remove = ('headers', 'footers', 'quotes')) 
categories = data_train.target_names 

target_map = {} 

for i in range(len(categories)): 
    if 'comp.' in categories[i]: 
     target_map[i] = 0 
    elif 'rec.' in categories[i]: 
     target_map[i] = 1 
    elif 'sci.' in categories[i]: 
     target_map[i] = 2 
    elif 'misc.forsale' in categories[i]: 
     target_map[i] = 3 
    elif 'talk.politics' in categories[i]: 
     target_map[i] = 4 
    else: 
     target_map[i] = 5 

y_temp = data_train.target 
y_train = [] 

for y in y_temp: 
    y_train.append(target_map[y]) 

y_temp = data_test.target 
y_test = [] 

for y in y_temp: 
    y_test.append(target_map[y]) 

count_vectorizer = feature_extraction.text.CountVectorizer(min_df = 0.01, max_df = 0.5, stop_words = 'english') 
x_train = count_vectorizer.fit_transform(data_train.data) 
x_test = count_vectorizer.transform(data_test.data) 

feature_names= count_vectorizer.get_feature_names() 

mnb_alpha_001 = naive_bayes.MultinomialNB(alpha = 0.01) 

mnb_alpha_001.fit(x_train, y_train) 

y_pred_001 = mnb_alpha_001.predict(x_test) 

print('Accuracy Of MNB With Alpha = 0.01 : ', metrics.accuracy_score(y_test,y_pred_001)) 

上面的代碼工作正常,執行分類。此外,我想列出每個類別(類別0 - 類別5)中將類別與其他類別分開的10個最易區分的單詞。

如果我只有2個類別(第0 - 1),我可以使用feature_log_prob_如下比較數概率:

diff = mnb_alpha_001.feature_log_prob_[1,:] - mnb_alpha_001.feature_log_prob_[0,:] 
name_diff = {} 
for i in range(len(feature_names)): 
    name_diff[feature_names[i]] = diff[i] 
names_diff_sorted = sorted(name_diff.items(), key = operator.itemgetter(1), reverse = True) 
for i in range(10): 
    print(names_diff_sorted[i]) 

上面的代碼將列出類別中的10個最區分的話1,它將它與0類區分開來。問題是,如果我有兩個以上的類別,我不能簡單地減去對數概率。

需要您的專家建議如何執行此任務,以便在每個類別中獲得10個最易區分的單詞?

非常感謝。

回答

1
acc=[] 
i=0 
rr=[0.001,0.01,0.1,1,10] 
for alp in [0,1,2,3,4]: 
    mnb = naive_bayes.MultinomialNB(alpha = alp) 
    mnb.fit(x_train, y_train) 
    y_pred = mnb.predict(x_test) 
    print('accuracy of Multinomial Naive Bayes for alpha ',rr[alp],'=', metrics.accuracy_score(y_test, y_pred)) 
    acc.append(metrics.accuracy_score(y_test, y_pred)) 


import operator 
pos,m = max(enumerate(acc), key=operator.itemgetter(1)) 
print("Max accuracy=",m," for alpha=",rr[pos]) 

for ss in [0,1,2,3,4,5]: 
    mnb = naive_bayes.MultinomialNB(alpha = rr[pos]) 
    mnb.fit(x_train, y_train) 
    y_pred = mnb.predict(x_test) 

    acc[alp]=metrics.accuracy_score(y_test, y_pred) 
    feature_names = count_vectorizer.get_feature_names() 
    diff = mnb.feature_log_prob_[ss,:] - np.max(mnb.feature_log_prob_[-ss:]) 

    name_diff = {} 
    for i in range(len(feature_names)): 
     name_diff[feature_names[i]] = diff[i] 

     names_diff_sorted = sorted(name_diff.items(), key = op.itemgetter(1), reverse = True) 
    for i in range(10): 
     print(ss,names_diff_sorted[i]) 
+0

你能詳細解答嗎? – Miguel

相關問題