python：如何從feature_importances中獲取真實的特徵名稱

我使用Python的sklearn隨機森林（ensemble.RandomForestClassifier）進行分類，並使用feature_importances_找到分類器的重要特徵。現在，我的代碼是：python：如何從feature_importances中獲取真實的特徵名稱

for trip in database: 
    venue_feature_start.append(Counter(trip['POI'])) 
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature 

feat_loc_vectorizer = DictVectorizer() 
feat_loc_vectorizer.fit(venue_feature_start) 
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start) 

orig_tfidf = TfidfTransformer() 
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr()) 

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr() 

le = LabelEncoder() 
labels = le.fit_transform(labels_raw) 
if "Unlabelled" in labels_raw: 
    unlabelled_int = int(le.transform(["Unlabelled"])) 
else: 
    unlabelled_int = -1 

valid_rows_idx = np.where(labels!=unlabelled_int)[0] 
labels = labels[valid_rows_idx] 
user_ids = np.asarray(user_ids_raw) 
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50) 
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)      
data = data[valid_rows_idx,:].toarray() 
for train_ind, test_ind in cv_indices: 
    train_data = data[train_ind,:] 
    test_data = data[test_ind,:] 
    labels_train = labels[train_ind] 
    labels_test = labels[test_ind] 

    print ("Training classifier...") 
    clf.fit(train_data,labels_train) 
    importances = clf.feature_importances_

現在的問題是，我得到尺寸580（與功能尺寸）的陣列，當我使用feature_importances，我想知道前20名重要特徵（前20名重要場館）

我覺得至少我應該知道的是20數量最多的來自重要度，指數，但我不知道：

如何獲得的指標排名前20位來自重要性有關
，因爲我用Dictvectorizer和TfidfTransformer所以我不知道如何搭配與真實姓名場地指數（「學校」，「家」，....）

任何想法幫助我？非常感謝你！

來源

2015-05-20 gladys0313

feature_importances_方法以特徵饋送給算法的順序返回相對重要性數字。因此，爲了獲得前20名的功能，你會想從最到最不重要的功能，例如像這樣進行排序：

importances = forest.feature_importances_ 
indices = numpy.argsort(importances)[-20:]

（[-20:]因爲你需要採取數組的最後20個元素，因爲argsort按升序排序）

來源

2015-05-20 16:40:04 user2314737

非常感謝你，但你知道如何匹配指數ces與真實的功能名稱？ – gladys0313

哈哈！嗯，取決於你的答案有多好：P – gladys0313

以獲得每個功能名稱的重要性，只是通過列名迭代和feature_importances在一起（它們相互映射）：

for feat, importance in zip(df.columns, clf.feature_importances_): 
    print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)

來源

2017-12-03 01:52:17

python：如何從feature_importances中獲取真實的特徵名稱

回答

相關問題