2015-05-20 205 views
0

我使用Python的sklearn隨機森林(ensemble.RandomForestClassifier)進行分類,並使用feature_importances_找到分類器的重要特徵。現在,我的代碼是:python:如何從feature_importances中獲取真實的特徵名稱

for trip in database: 
    venue_feature_start.append(Counter(trip['POI'])) 
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature 

feat_loc_vectorizer = DictVectorizer() 
feat_loc_vectorizer.fit(venue_feature_start) 
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start) 

orig_tfidf = TfidfTransformer() 
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr()) 

# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types 

data = orig_ven_feat.tocsr() 

le = LabelEncoder() 
labels = le.fit_transform(labels_raw) 
if "Unlabelled" in labels_raw: 
    unlabelled_int = int(le.transform(["Unlabelled"])) 
else: 
    unlabelled_int = -1 

valid_rows_idx = np.where(labels!=unlabelled_int)[0] 
labels = labels[valid_rows_idx] 
user_ids = np.asarray(user_ids_raw) 
# user_ids is for cross validation, labels is for classification 

clf = ensemble.RandomForestClassifier(n_estimators = 50) 
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)      
data = data[valid_rows_idx,:].toarray() 
for train_ind, test_ind in cv_indices: 
    train_data = data[train_ind,:] 
    test_data = data[test_ind,:] 
    labels_train = labels[train_ind] 
    labels_test = labels[test_ind] 

    print ("Training classifier...") 
    clf.fit(train_data,labels_train) 
    importances = clf.feature_importances_ 

現在的問題是,我得到尺寸580(與功能尺寸)的陣列,當我使用feature_importances,我想知道前20名重要特徵(前20名重要場館)

我覺得至少我應該知道的是20數量最多的來自重要度,指數,但我不知道:

  1. 如何獲得的指標排名前20位來自重要性有關

  2. ,因爲我用Dictvectorizer和TfidfTransformer所以我不知道如何搭配與真實姓名場地指數(「學校」,「家」,....)

任何想法幫助我?非常感謝你!

回答

5

feature_importances_方法以特徵饋送給算法的順序返回相對重要性數字。因此,爲了獲得前20名的功能,你會想從最到最不重要的功能,例如像這樣進行排序:

importances = forest.feature_importances_ 
indices = numpy.argsort(importances)[-20:] 

[-20:]因爲你需要採取數組的最後20個元素,因爲argsort按升序排序)

+0

非常感謝你,但你知道如何匹配指數ces與真實的功能名稱? – gladys0313

+0

哈哈!嗯,取決於你的答案有多好:P – gladys0313

0

以獲得每個功能名稱的重要性,只是通過列名迭代和feature_importances在一起(它們相互映射):

for feat, importance in zip(df.columns, clf.feature_importances_): 
    print 'feature: {f}, importance: {i}'.format(f=feat, i=importance) 
相關問題