scikit功能重要性選擇體驗

Scikit-learn有一種機制來使用極端隨機化樹對特徵（分類）進行排名。scikit功能重要性選擇體驗

forest = ExtraTreesClassifier(n_estimators=250, 
          compute_importances=True, 
          random_state=0)

我有一個問題，如果這種方法是做一個「單變量」或「多元」功能排名。單變量的情況是各個特徵相互比較的情況。我希望在此澄清一些情況。任何其他參數，我應該嘗試擺弄？此排名方法的任何經驗和陷阱也表示讚賞。這個排名的輸出識別特徵數字（5,20,7）我想檢查特徵數字是否真的對應於特徵矩陣中的行，也就是說，特徵數字5對應於特徵中的第六行矩陣（從0開始）。

來源

2012-10-19 user963386

你能明確地重申你的問題是什麼？你正在給出一堆近似的斷言，很難猜測你真正的問題是什麼。同樣在scikit-learn中，數據被整形爲'（n_samples，n_features）'，所以特徵索引是針對數據矩陣的列而不是行的。 – ogrisel

要回答第一個問題：多變量。 –

對不起，我的問題很混亂，但我正在學習這個領域，我同意這個問題不清楚。無論如何，謝謝你的澄清。 – user963386

我不是專家，但，這不是單因素，事實上總有重要性從每棵樹的特徵重要性（取平均值我認爲）計算。

對於每棵樹，計算重要性from the impurity of the split

我使用了這種方法，它似乎給出了很好的結果，從我的觀點來看比單變量方法更好。但我不知道任何技術來測試結果，除了數據集的知識。

訂購時，該功能正確，您應該遵循this example並修改它有點像這樣使用起來pandas.DataFrame，並適當列名：

import numpy as np 

from sklearn.ensemble import ExtraTreesClassifier 

X = pandas.DataFrame(...) 
Y = pandas.Series(...) 

# Build a forest and compute the feature importances 
forest = ExtraTreesClassifier(n_estimators=250, 
           random_state=0) 

forest.fit(X, y) 

feature_importance = forest.feature_importances_ 
feature_importance = 100.0 * (feature_importance/feature_importance.max()) 
sorted_idx = np.argsort(feature_importance)[::-1] 
print "Feature importance:" 
i=1 
for f,w in zip(X.columns[sorted_idx], feature_importance[sorted_idx]): 
    print "%d) %s : %d" % (i, f, w) 
    i+=1 
pos = np.arange(sorted_idx.shape[0]) + .5 
plt.subplot(1, 2, 2) 
nb_to_display = 30 
plt.barh(pos[:nb_to_display], feature_importance[sorted_idx][:nb_to_display], align='center') 
plt.yticks(pos[:nb_to_display], X.columns[sorted_idx][:nb_to_display]) 
plt.xlabel('Relative Importance') 
plt.title('Variable Importance') 
plt.show()

來源

2014-02-02 15:51:31 Mermoz

scikit功能重要性選擇體驗

回答

相關問題