我正在從一個Sklearn和GradientBoostingClassifier我正在從詳細輸出一些輸出奇怪。我從我的整個數據集中隨機抽取10%的樣本,大多數似乎沒問題,但有時候我會得到奇怪的輸出和糟糕的結果。有人可以解釋發生了什麼嗎?Scikit-Learn GradientBoostingClassifier中的錯誤?
「好」的結果:
n features = 168
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=2000, presort='auto', random_state=None,
subsample=1.0, verbose=1, warm_start=False)
Iter Train Loss Remaining Time
1 0.6427 40.74m
2 0.6373 40.51m
3 0.6322 40.34m
4 0.6275 40.33m
5 0.6230 40.31m
6 0.6187 40.18m
7 0.6146 40.34m
8 0.6108 40.42m
9 0.6071 40.43m
10 0.6035 40.28m
20 0.5743 40.12m
30 0.5531 39.74m
40 0.5367 39.49m
50 0.5237 39.13m
60 0.5130 38.78m
70 0.5041 38.47m
80 0.4963 38.34m
90 0.4898 38.22m
100 0.4839 38.14m
200 0.4510 37.07m
300 0.4357 35.49m
400 0.4270 33.87m
500 0.4212 31.77m
600 0.4158 29.82m
700 0.4108 27.74m
800 0.4065 25.69m
900 0.4025 23.55m
1000 0.3987 21.39m
2000 0.3697 0.00s
predicting
this_file_MCC = 0.5777
「壞」的結果:
Training the classifier
n features = 168
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=1.0, loss='deviance', max_depth=5,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, presort='auto', random_state=None,
subsample=1.0, verbose=1, warm_start=False)
Iter Train Loss Remaining Time
1 0.5542 1.07m
2 0.5299 1.18m
3 0.5016 1.14m
4 0.4934 1.16m
5 0.4864 1.19m
6 0.4756 1.21m
7 0.4699 1.24m
8 0.4656 1.26m
9 0.4619 1.24m
10 0.4572 1.26m
20 0.4244 1.27m
30 0.4063 1.24m
40 0.3856 1.20m
50 0.3711 1.18m
60 0.3578 1.13m
70 0.3407 1.10m
80 0.3264 1.09m
90 0.3155 1.06m
100 0.3436 1.04m
200 0.3516 46.55s
300 1605.5140 29.64s
400 52215150662014.0469 13.70s
500 585408988869401440279216573629431147797247696359586211550088082222979417986203510562624281874357206861232303015821113689812886779519405981626661580487933040706291550387961400555272759265345847455837036753780625546140668331728366820653710052494883825953955918423887242778169872049367771382892462080.0000 0.00s
predicting
this_file_MCC = 0.0398
你可以找出哪些數據樣本是造成這個問題呢? –
我正在訓練有大約100萬行數據集的「咬樣本」。每個樣本約有10萬行。由於我在相同的示例文件上運行了sklearn.ensemble.ExtraTreesClassifier,並且沒有錯誤,因此該問題似乎與輸入數據無關。 – denson
好的。我在問,這樣我們就可以有一個可重複使用的例子來在sklearn上添加一個bug。 –