如何爲scikit-learn分類器獲取最豐富的功能？

在機器學習包，比如liblinear的分類和NLTK提供了一個方法show_most_informative_features()，這是調試功能真的有用：如何爲scikit-learn分類器獲取最豐富的功能？

viagra = None   ok : spam  =  4.5 : 1.0 
hello = True   ok : spam  =  4.5 : 1.0 
hello = None   spam : ok  =  3.3 : 1.0 
viagra = True   spam : ok  =  3.3 : 1.0 
casino = True   spam : ok  =  2.0 : 1.0 
casino = None   ok : spam  =  1.5 : 1.0

我的問題是，如果類似的東西是在scikit學習的分類實施。我搜查了文檔，但找不到類似的東西。

如果還沒有這樣的功能，有人知道如何獲得這些值的解決方法嗎？

非常感謝！

來源

2012-06-20 tobigue

你是指最具歧視性的參數？ – Simon

我不確定你的意思是什麼參數。我的意思是最挑剔的功能，如在袋的詞模型的垃圾郵件分類，哪些詞給每個類的最證據。不是我所理解的「設置」分類的參數 - 就像學率等 – tobigue

@eowl：在機器學習的說法，*參數*是通過基於學習過程*特點*你的訓練集產生的設置。學習率等是超參數*。 –

隨着larsmans代碼的幫助下，我來到了這個代碼，二進制的情況：

def show_most_informative_features(vectorizer, clf, n=20): 
    feature_names = vectorizer.get_feature_names() 
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) 
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) 
    for (coef_1, fn_1), (coef_2, fn_2) in top: 
     print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

來源

2012-06-21 14:55:49 tobigue

謝謝，正是我需要的！ – WeaselFox

如何從main方法調用函數？ f1和f2代表什麼？我試圖用scikit-learn從決策樹分類器中調用函數。 – 2014-03-30 20:37:35

這段代碼只適用於具有'coef_'數組的線性分類器，所以不幸的是我不認爲可以將它與sklearn的決策樹分類器一起使用。 'fn_1'和'fn_2'代表特徵名稱。 – tobigue

分類器本身不記錄功能名稱，它們只是看到數字數組。但是，如果您使用Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer，和您使用的是線性模型（例如LinearSVC或樸素貝葉斯）提取的功能，那麼你可以使用同樣的伎倆是，document classification example用途。實施例（未測試，可以包含一個錯誤或兩個）：

def print_top10(vectorizer, clf, class_labels): 
    """Prints features with the highest coefficient values, per class""" 
    feature_names = vectorizer.get_feature_names() 
    for i, class_label in enumerate(class_labels): 
     top10 = np.argsort(clf.coef_[i])[-10:] 
     print("%s: %s" % (class_label, 
       " ".join(feature_names[j] for j in top10)))

這是爲多類分類;對於二進制情況，我認爲你應該只使用clf.coef_[0]。您可能需要對class_labels進行排序。

來源

2012-06-20 09:51:55

是的，在我的情況下，我只有兩個班，但與您的代碼我能夠拿出我想要的東西。非常感謝！ – tobigue

@eowl：不客氣。你有'coef_'的'np.abs'嗎？因爲獲得最高價值的係數只會返回指示正面類的特徵。 –

某事就像那樣......我對列表進行了排序，並將頭部和尾部分開，這使您仍然可以看到什麼類的特徵票。我發佈我的解決方案[下]（http://stackoverflow.com/a/11140887/979377）。 – tobigue

RandomForestClassifier還沒有一個coef_ attrubute，但它會在0.17版本中，我想。但是，請參閱Recursive feature elimination on Random Forest using scikit-learn中的RandomForestClassifierWithCoef類。這可能會給你一些想法來解決上述限制。

來源

2015-07-28 18:35:13

你也可以做這樣的事情的，以創造的重要特徵圖：

importances = clf.feature_importances_ 
std = np.std([tree.feature_importances_ for tree in clf.estimators_], 
     axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
#print("Feature ranking:") 


# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(train[features].shape[1]), importances[indices], 
    color="r", yerr=std[indices], align="center") 
plt.xticks(range(train[features].shape[1]), indices) 
plt.xlim([-1, train[features].shape[1]]) 
plt.show()

來源

2016-08-01 14:55:15 Oleole

要添加的更新，RandomForestClassifier現在支持.feature_importances_屬性。這個attribute告訴你有多少觀察到的差異是由該特徵解釋的。顯然，所有這些值的總和必須< = 1

執行功能的工程，當我發現這個屬性是非常有用的。

感謝scikit-learn團隊和貢獻者的實施！

編輯：這既適用於隨機森林和GradientBoosting。所以RandomForestClassifier，RandomForestRegressor，GradientBoostingClassifier和GradientBoostingRegressor都支持這一點。

來源

2016-08-13 07:31:42 ClimbsRocks

我們最近發佈了一個庫（https://github.com/TeamHG-Memex/eli5），它可以做到這一點：它處理variuos分類從scikit學習，二進制/多類案件，可以根據特徵值來突出顯示文本，用IPython中集成等

來源

2016-11-24 17:42:54

如何爲scikit-learn分類器獲取最豐富的功能？

回答

相關問題