R中的隨機森林混亂矩陣Caret

我有二進制YES/NO Class響應的數據。使用以下代碼來運行RF模型。我在獲取混淆矩陣結果時遇到了問題。R中的隨機森林混亂矩陣Caret

dataR <- read_excel("*:/*.xlsx") 
Train <- createDataPartition(dataR$Class, p=0.7, list=FALSE) 
training <- dataR[ Train, ] 
testing <- dataR[ -Train, ] 

model_rf <- train( Class~., tuneLength=3, data = training, method = 
"rf", importance=TRUE, trControl = trainControl (method = "cv", number = 
5))

結果：

Random Forest 

3006 samples 
82 predictor 
2 classes: 'NO', 'YES' 

No pre-processing 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404 
Addtional sampling using SMOTE 

Resampling results across tuning parameters: 

mtry Accuracy Kappa  
    2 0.7870921 0.2750655 
    44 0.7787721 0.2419762 
87 0.7767760 0.2524898 

Accuracy was used to select the optimal model using the largest value. 
The final value used for the model was mtry = 2.

到目前爲止很好，但是當我運行此代碼：

# Apply threshold of 0.50: p_class 
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") 

# Create confusion matrix 
p <-confusionMatrix(class_log, testing[["Class"]]) 

##gives the accuracy 
p$overall[1]

我得到這個錯誤：

Error in model_rf[, 1] : incorrect number of dimensions

我很感激，如果你傢伙可以幫助我得到混淆矩陣結果。

來源

2017-10-18 Mike

將'model_rf [，1]'打印到控制檯並查看它。 – jsb

如果你在你的問題中包含一個[最小可重現的例子]（https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example），它會更容易幫助你。 – jsb

據我瞭解，你想獲得獲得在插入符號交叉驗證的混淆矩陣。

爲此，您需要在trainControl中指定savePredictions。如果設置爲"final"，則保存最佳模型的預測。通過指定classProbs = T每個班級的概率也將被保存。

data(iris) 
iris_2 <- iris[iris$Species != "setosa",] #make a two class problem 
iris_2$Species <- factor(iris_2$Species) #drop levels 

library(caret) 
model_rf <- train(Species~., tuneLength = 3, data = iris_2, method = 
         "rf", importance = TRUE, 
        trControl = trainControl(method = "cv", 
              number = 5, 
              savePredictions = "final", 
              classProbs = T))

預測是在：

model_rf$pred

分類爲每CV前方作戰點，排序爲原始數據幀：

model_rf$pred[order(model_rf$pred$rowIndex),2]

以獲得混淆矩陣：

confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species) 
#output 
Confusion Matrix and Statistics 

      Reference 
Prediction versicolor virginica 
    versicolor   46   6 
    virginica   4  44 

       Accuracy : 0.9    
       95% CI : (0.8238, 0.951) 
    No Information Rate : 0.5    
    P-Value [Acc > NIR] : <2e-16   

        Kappa : 0.8    
Mcnemar's Test P-Value : 0.7518   

      Sensitivity : 0.9200   
      Specificity : 0.8800   
     Pos Pred Value : 0.8846   
     Neg Pred Value : 0.9167   
      Prevalence : 0.5000   
     Detection Rate : 0.4600   
    Detection Prevalence : 0.5200   
     Balanced Accuracy : 0.9000   

     'Positive' Class : versicolor

在兩類設置通常特定因爲閾值概率是次優的。通過優化Kappa或Youden的J統計量（或任何其他優選的）作爲概率的函數，可以在訓練後找到最佳閾值。下面是一個例子：

sapply(1:40/40, function(x){ 
    versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4] 
    class <- ifelse(versicolor >=x, "versicolor", "virginica") 
    mat <- confusionMatrix(class, iris_2$Species) 
    kappa <- mat$overall[2] 
    res <- data.frame(prob = x, kappa = kappa) 
    return(res) 
})

這裏就不在threshold == 0.5但在0.1中獲得的最高卡帕。這應該小心使用，因爲它可能導致過度貼合。

來源

2017-10-18 20:48:33 missuse

謝謝。只有一個問題，在這個代碼中，cm pred模型僅在將train定義爲數據集時纔有用。我認爲對於pred我需要定義測試數據集。當我測試$ Class的代碼時，它給出了這個錯誤：表中的錯誤（數據，參考，dnn = dnn，...）：所有參數必須具有相同的長度 – Mike

此代碼導致交叉驗證摺疊混亂矩陣。由於交叉驗證是在列車上完成的，因此僅適用於列車組。爲了獲得測試集上的混淆矩陣，必須首先預測測試集樣本的類別，並通過'confusionMatrix'函數將其與真實類別進行比較。 – missuse

你可以試試這個產生混淆矩陣和檢查精度

m <- table(class_log, testing[["Class"]]) 
m #confusion table 

#Accuracy 
(sum(diag(m)))/nrow(testing)

來源

2017-10-18 17:48:39

謝謝，但運行class_log部分時出錯。我編輯我的問題 – Mike

的代碼塊class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")是執行以下測試的if-else語句：

In the first column of model_rf , if the number is greater than 0.50, return "YES", else return "NO", and save the results in object class_log .

因此，代碼實質上創建基於數字向量的類標籤的字符向量，「是」和「否」。

來源

2017-10-18 18:01:11 jsb

您需要將您的模型應用於測試集。

prediction.rf <- predict(model_rf, testing, type = "prob")

然後做class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")

來源

2017-10-18 18:30:30

謝謝。 class_log代碼適用於二進制Y/N響應類？ – Mike

'prediction.rf'將會有實際值（注意'type =「prob」'）。你也可以通過'type =「raw」'來立即獲取二進制文件，但是這不會讓你控制閾值。請參閱'？predict.train' –

R中的隨機森林混亂矩陣Caret

回答

相關問題