我使用Python的sklearn
隨機森林(ensemble.RandomForestClassifier
)進行分類,並使用feature_importances_
找到分類器的重要特徵。現在,我的代碼是:python:如何從feature_importances中獲取真實的特徵名稱
for trip in database:
venue_feature_start.append(Counter(trip['POI']))
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_vectorizer.fit(venue_feature_start)
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
else:
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier...")
clf.fit(train_data,labels_train)
importances = clf.feature_importances_
現在的問題是,我得到尺寸580(與功能尺寸)的陣列,當我使用feature_importances,我想知道前20名重要特徵(前20名重要場館)
我覺得至少我應該知道的是20數量最多的來自重要度,指數,但我不知道:
如何獲得的指標排名前20位來自重要性有關
,因爲我用Dictvectorizer和TfidfTransformer所以我不知道如何搭配與真實姓名場地指數(「學校」,「家」,....)
任何想法幫助我?非常感謝你!
非常感謝你,但你知道如何匹配指數ces與真實的功能名稱? – gladys0313
哈哈!嗯,取決於你的答案有多好:P – gladys0313