0
我在一組100名患者中使用插入符號進行ML分類。由於該變量不平衡(每組13/87個樣本),我使用SMOTE和ROSE執行子採樣。插入 - 類不平衡的子抽樣
使用svmRadial的不同分類模型的平均ROC爲:62.5%無子採樣,76.4%有ROSE和77.8%有SMOTE。如果我觀察3次重複10倍CV後的預測結果的準確性,我會得到最好的結果,而不需要進行二次抽樣(87%),而SMOTE和ROSE表現更差(71%和39%)。
有人可以向我解釋爲什麼一個較高的ROC for SMOTE和ROSE轉化爲一個較低的準確度預測嗎? 另外,我會預計SMOTE和ROSE會改變樣本數量以及樣本分佈,但是當我查看我的混淆矩陣時,所有樣本的總數總是n = 300(沒有二次取樣還有SMOTE和ROSE)。
鴕鳥政策太在意分類器的精度差(它應該只是作爲一個例子來說明我的問題...)
感謝您的幫助,
菲利普
my_method <- "svmRadial"
ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE,
summaryFunction = twoClassSummary, savePredictions = "final")
set.seed(1)
orig_fit <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),vebose=F)
ctrl$sampling <- "rose"
set.seed(1)
rose_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
ctrl$sampling <- "smote"
set.seed(1)
smote_inside <- train(Class ~ ., data = chosen_train,
method = my_method,
trControl = ctrl, metric="ROC", preProc = c("center", "scale"),verbose=F)
inside_models <- list(original = orig_fit, rose = rose_inside, smote=smote_inside)
set.seed(1)
inside_resampling <- resamples(inside_models)
>summary(inside_resampling, metric = "ROC")
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
original 0.4444 0.5556 0.6250 0.6569 0.7431 1 0
rose 0.3889 0.6667 0.7639 0.7757 0.8889 1 0
smote 0.4444 0.6667 0.7778 0.7845 0.8889 1 0
>confusionMatrix(rose_inside$pred$pred,rose_inside$pred$obs)
Reference
Prediction MAIN OTHER
MAIN 15 158
OTHER 24 103
Accuracy : 0.3933
95% CI : (0.3377, 0.4511)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1
Kappa : -0.0897
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.38462
Specificity : 0.39464
Pos Pred Value : 0.08671
Neg Pred Value : 0.81102
Prevalence : 0.13000
Detection Rate : 0.05000
Detection Prevalence : 0.57667
Balanced Accuracy : 0.38963
'Positive' Class : MAIN
> confusionMatrix(smote_inside$pred$pred,smote_inside$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 6 55
OTHER 33 206
Accuracy : 0.7067
95% CI : (0.6516, 0.7576)
No Information Rate : 0.87
P-Value [Acc > NIR] : 1.00000
Kappa : -0.0459
Mcnemar's Test P-Value : 0.02518
Sensitivity : 0.15385
Specificity : 0.78927
Pos Pred Value : 0.09836
Neg Pred Value : 0.86192
Prevalence : 0.13000
Detection Rate : 0.02000
Detection Prevalence : 0.20333
Balanced Accuracy : 0.47156
'Positive' Class : MAIN
> confusionMatrix(orig_fit$pred$pred,orig_fit$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction MAIN OTHER
MAIN 0 0
OTHER 39 261
Accuracy : 0.87
95% CI : (0.8266, 0.9059)
No Information Rate : 0.87
P-Value [Acc > NIR] : 0.5426
Kappa : 0
Mcnemar's Test P-Value : 1.166e-09
Sensitivity : 0.00
Specificity : 1.00
Pos Pred Value : NaN
Neg Pred Value : 0.87
Prevalence : 0.13
Detection Rate : 0.00
Detection Prevalence : 0.00
Balanced Accuracy : 0.50
'Positive' Class : MAIN
您可能不應該查看班級統計信息來評估測試集。測試集ROC曲線產生什麼? – topepo