2013-07-14 126 views
1

我正在使用脫字符號來匹配gbm模型。當我撥打trainedGBM$finalModel$fit時,我得到的輸出顯示正確。爲什麼R gbm模型預測不符合模型擬合?

但是當我打電話給predict(trainedGBM$finalModel, origData, type="response")時,我得到了非常不同的結果,並且predict(trainedGBM$finalModel, type="response")即使附加了origData也會產生不同的結果。按照我的想法,這些調用應該產生相同的輸出。有人能幫我找出問題嗎?

library(caret) 
library(gbm) 

attach(origData) 
gbmGrid <- expand.grid(.n.trees = c(2000), 
         .interaction.depth = c(14:20), 
         .shrinkage = c(0.005)) 
trainedGBM <- train(y ~ ., method = "gbm", distribution = "gaussian", 
        data = origData, tuneGrid = gbmGrid, 
        trControl = trainControl(method = "repeatedcv", number = 10, 
              repeats = 3, verboseIter = FALSE, 
              returnResamp = "all")) 
ntrees <- gbm.perf(trainedGBM$finalModel, method="OOB") 
data.frame(y, 
      finalModelFit = trainedGBM$finalModel$fit, 
      predictDataSpec = predict(trainedGBM$finalModel, origData, type="response", n.trees=ntrees), 
      predictNoDataSpec = predict(trainedGBM$finalModel, type="response", n.trees=ntrees)) 

上述代碼產生以下部分結果:基於您gbmGrid

y finalModelFit predictDataSpec predictNoDataSpec 
9000  6138.8920  2387.182   2645.993 
5000  3850.8817  2767.990   2467.157 
3000  3533.1183  2753.551   2044.578 
2500  1362.9802  2672.484   1972.361 
1500  5080.2112  2449.185   2000.568 
750  2284.8188  2728.829   2063.829 
1500  2672.0146  2359.566   2344.451 
5000  3340.5828  2435.137   2093.939 
    0  1303.9898  2377.770   2041.871 
500  879.9798  2691.886   2034.307 
3000  2928.4573  2327.627   1908.876 
+0

我的猜測是正確的,這是在'caret'包中?當所有你需要做的事情都放在'library(_whatever_package_train_came_from)' –

+0

這個問題上時,讓人們猜測這種問題是非常不恰當的。此外,作爲一個單獨的whinge:使用'attach'是一個難以理解的常見原因 - 錯誤。當然,你應該更充分地描述「origData」。 –

+0

什麼樣的數據描述在這裏有幫助?我有大約7000條記錄,y可以作爲26個特徵的函數,包括因素和數字。 – Nostradamus

回答

7

,只有你相互作用的深度將14和20之間變化,和樹木的收縮和數目是固定的分別爲0.005和2000。 TrainedGBM的設計目前只能找到最佳的交互級別。由gbm.perf計算的ntrees然後要求,鑑於最佳交互水平介於14和20之間,基於OOB標準的樹的最佳數量是多少。由於預測取決於模型中的樹木數量,基於trainedGBM預測會使用ntrees = 2000,並根據gbm.perf預測將使用ntrees從函數估算出的最佳數量。這將解釋您的trainedGBM$finalModel$fitpredict(trainedGBM$finalModel, type="response", n.trees=ntrees)之間的差異。

要顯示基於該虹膜數據的示例設置使用GBM作爲分類而不是迴歸模型

library(caret) 
library(gbm) 

set.seed(42) 

gbmGrid <- expand.grid(.n.trees = 100, 
        .interaction.depth = 1:4, 
        .shrinkage = 0.05) 


trainedGBM <- train(Species ~ ., method = "gbm", distribution='multinomial', 
       data = iris, tuneGrid = gbmGrid, 
       trControl = trainControl(method = "repeatedcv", number = 10, 
             repeats = 3, verboseIter = FALSE, 
             returnResamp = "all")) 
print(trainedGBM)   

給予

# Resampling results across tuning parameters: 

# interaction.depth Accuracy Kappa Accuracy SD Kappa SD 
# 1     0.947  0.92 0.0407  0.061 
# 2     0.947  0.92 0.0407  0.061 
# 3     0.944  0.917 0.0432  0.0648 
# 4     0.944  0.917 0.0395  0.0592 

# Tuning parameter 'n.trees' was held constant at a value of 100 
# Tuning parameter 'shrinkage' was held constant at a value of 0.05 
# Accuracy was used to select the optimal model using the largest value. 
# The final values used for the model were interaction.depth = 1, n.trees = 100 
# and shrinkage = 0.05.  

查找的樹木上最佳交互深度條件最佳數量:

ntrees <- gbm.perf(trainedGBM$finalModel, method="OOB") 
# Giving ntrees = 50 

如果我們通過改變樹木的數量訓練模型和互動深度:

gbmGrid2 <- expand.grid(.n.trees = 1:100, 
        .interaction.depth = 1:4, 
        .shrinkage = 0.05) 

trainedGBM2 <- train(Species ~ ., method = "gbm", 
       data = iris, tuneGrid = gbmGrid2, 
       trControl = trainControl(method = "repeatedcv", number = 10, 
             repeats = 3, verboseIter = FALSE, 
             returnResamp = "all")) 

print(trainedGBM2) 

# Tuning parameter 'shrinkage' was held constant at a value of 0.05 
# Accuracy was used to select the optimal model using the largest value. 
# The final values used for the model were interaction.depth = 2, n.trees = 39 
# and shrinkage = 0.05. 

注意樹木的最佳數量,當我們改變了樹木和交互深度兩者數量相當接近,對gbm.perf計算。