我有一套數據,我正在使用額外的樹分類器開發一個預測模型,如下面的代碼所示,在最初的一組代碼中顯示et_scores相當令人失望,我跑步看到下面更進一步,看起來更好,然後我做了一個學習圖表,事情看起來不太熱。總之很混亂。 初始代碼:解釋Scikit-Learn模型輸出,額外的樹分類器不同的措施
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
#split the dataset for train and test
combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75
train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False]
et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0)
labels = train[list(label_columns)].values
tlabels = test[list(label_columns)].values
features = train[list(columns)].values
tfeatures = test[list(columns)].values
et_score = cross_val_score(et, features, labels.ravel(), n_jobs=-1)
print("{0} -> ET: {1})".format(label_columns, et_score))
給我:
['Campaign_Response'] -> ET: [ 0.58746427 0.31725003 0.43522521])
沒有這麼熱! 然後我伸出數據:
et.fit(features,labels.ravel())
et.score(tfeatures,tlabels.ravel())
Out[16]:0.7434136771300448
沒那麼糟 然後在訓練數據:
et.score(features,labels.ravel())
Out[17]:0.85246473144769563
再次,不錯,但沒有關係,早期的比分? 然後運行:
from sklearn.learning_curve import validation_curve
def plot_validation_curve(estimator, X, y, param_name, param_range,
ylim=(0, 1.1), cv=5, n_jobs=-1, scoring=None):
estimator_name = type(estimator).__name__
plt.title("Validation curves for %s on %s"
% (param_name, estimator_name))
plt.ylim(*ylim); plt.grid()
plt.xlim(min(param_range), max(param_range))
plt.xlabel(param_name)
plt.ylabel("Score")
train_scores, test_scores = validation_curve(
estimator, X, y, param_name, param_range,
cv=cv, n_jobs=n_jobs, scoring=scoring)
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.semilogx(param_range, train_scores_mean, 'o-', color="r",
label="Training score")
plt.semilogx(param_range, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
print("Best test score: {:.4f}".format(test_scores_mean[-1]))
依次爲:
clf = ExtraTreesClassifier(max_depth=8)
param_name = 'max_depth'
param_range = [1, 2, 4, 8, 16, 32]
plot_validation_curve(clf, features,labels.ravel(),
param_name, param_range, scoring='roc_auc')
給我一個圖表和傳說似乎並不反映之前的信息:
Best test score: 0.3592
和最後sklearn指標給我
Accuracy:0.737
Classification report
precision recall f1-score support
0 0.76 0.79 0.78 8311
1 0.70 0.66 0.68 6134
avg/total 0.74 0.74 0.74 14445
在我看來,我應該能夠更好地解釋這個東西任何人都可以幫忙嗎?
非常有幫助,謝謝 – dartdog