RandomForestClassifier的性能差

我編寫了以下Python代碼，用於在UCI ML回購（使用默認參數設置）的Forest CoverType數據集上運行RandomForestClassifier。然而，結果非常差，準確率在60％左右，而這項技術應該能夠達到90％以上（例如Weka）。我已經嘗試將n_estimators增加到100，但這並沒有帶來太多的改進。RandomForestClassifier的性能差

關於我能做些什麼來獲得更好的結果，在scikit-learn中使用這種技術有什麼想法，或者可能是這種糟糕性能的原因？

from sklearn.datasets import fetch_covtype 
    from sklearn.ensemble import RandomForestClassifier 
    from sklearn import cross_validation 


    covtype = fetch_covtype() 
    clf = RandomForestClassifier() 
    scores = cross_validation.cross_val_score(clf, covtype.data, covtype.target) 
    print scores 

[ 0.5483831 0.58210057 0.61055001]

來源

2016-07-05 Bart Goethals

你可以嘗試以下操作來提高你的分數： -

火車上的所有提供給您的屬性模型。它會過度訓練，但它會讓你知道你在訓練集上可以達到多少準確度。
下一頁使用clf.feature_importances_
使用網格搜索CV調整超參數模型下降最少的重要特徵。使用交叉驗證和oob_score（超出分數）來更好地估計泛化。

來源

2016-07-05 09:40:43

您是否獲得90％的相同數據集和相同的估計值？由於數據集之間的用於訓練的數據子集

第11340條記錄分裂

用於驗證數據的下一個3780個記錄子集

最後的565892條記錄用於測試的數據子集

和文檔要求以下性能，這使您的未調整的隨機森林不那麼差：

70％的神經網絡（反向傳播）

58％線性判別分析

至於n_estimators等於100，你可以增加多達500個，1.000甚至更多。檢查每個結果並在分數開始穩定時保留該數字。

問題可能來自Weka的默認超參數與Scikit-Learn相比。您可以調整其中一些以改善結果：

max_features用於在每個樹節點上分割的要素數。
max_depth也許模型overfits了一下你的訓練數據通過獲取太深
min_samples_split，min_samples_leaf，min_weight_fraction_leaf和max_leaf_nodes涉及樣本的枝葉間重新劃分 - 何時讓他們，還是不行。

您也可以嘗試通過組合它們或通過減小尺寸來處理您的功能。

你應該有kaggle腳本來看看如here被他們描述瞭如何獲得78％與ExtraTreesClassifier（然而，訓練集包含了11.340 + 3780個recors - 他們似乎使用更高一些n_estimators雖然

來源

2016-07-05 10:05:44

我設法用GridSearchCV

from sklearn.datasets import fetch_covtype 
from sklearn.ensemble import RandomForestClassifier 
from sklearn import cross_validation 
from sklearn import grid_search 
import numpy as np 


covtype = fetch_covtype() 
clf = RandomForestClassifier() 

X_train, X_test, y_train, y_test = cross_validation.train_test_split(covtype.data, 
                    covtype.target, 
                    test_size=0.33, 
                    random_state=42) 
params = {'n_estimators':[30, 50, 100], 
      'max_features':['sqrt', 'log2', 10]} 
gsv = grid_search.GridSearchCV(clf, params, cv=3, 
           n_jobs=-1, scoring='f1') 
gsv.fit(X_train, y_train) 

print metrics.classification_report(y_train, gsv.best_estimator_.predict(X_train)) 

print metrics.classification_report(y_test, gsv.best_estimator_.predict(X_test))

輸出讓你的模型很好的改善：

  precision recall f1-score support 

      1  1.00  1.00  1.00 141862 
      2  1.00  1.00  1.00 189778 
      3  1.00  1.00  1.00  24058 
      4  1.00  1.00  1.00  1872 
      5  1.00  1.00  1.00  6268 
      6  1.00  1.00  1.00  11605 
      7  1.00  1.00  1.00  13835 

avg/total  1.00  1.00  1.00 389278 

      precision recall f1-score support 

      1  0.97  0.95  0.96  69978 
      2  0.95  0.97  0.96  93523 
      3  0.95  0.96  0.95  11696 
      4  0.92  0.86  0.89  875 
      5  0.94  0.78  0.86  3225 
      6  0.94  0.90  0.92  5762 
      7  0.97  0.95  0.96  6675 

avg/total  0.96  0.96  0.96 191734

這是不是太遙遠的Kaggle leaderboard分數（請注意，Kaggle比賽採用的是更具挑戰性的數據拆分，但！）

如果你想看到更多的改進，那麼你將不得不考慮的不平課程以及如何最好地選擇您的培訓數據。

注意

我用估計的數量較少比我會通常以節省時間，但是在訓練集中表現不錯的機型，所以你可能沒有考慮這一點。

我使用了一小部分max_features，因爲通常這會減少模型訓練中的偏差。雖然這並非總是如此。

我用f1得分，因爲我不太瞭解數據集，並且f1在分類問題上工作得很好。

來源

2016-07-06 13:46:21 ncfirth

我試過你的代碼，並且還打印出了n_estimators = 100和max_features = 10的最佳參數（best_params_）。然後，我調整我的代碼以使用這些參數，並且還添加了參數scoring ='f1_weighted'。不幸的是，我仍然得到同樣糟糕的結果。任何想法？ clf = RandomForestClassifier（n_estimators = 100，max_features = 10） scores = cross_validation.cross_val_score（clf，covtype.data，covtype.target，scoring ='f1_weighted'） –

RandomForestClassifier的性能差

回答

相關問題