2017-01-24 54 views
1

我試圖找到與0.5的截止閾值相對應的精度值,作爲我的模型評估(邏輯迴歸)的一部分。 我得到了numeric(0)而不是Y值。查找對應於特定X的Y值

y_hat = predict(mdl, newdata=ds_ts, type="response") 

pred = prediction(y_hat, ds_ts$popularity) 

perfPrc = performance(pred, "prec")   

xPrc = [email protected][[1]] 

# Find the precision value corresponds to a cutoff threshold of 0.5 
prc = yPrc[c(0.5000188)] # perfPrc isn't continuous - closest value to 0.5 

prC# output is 'numeric(0)' ` 

Precision vs Cutoff

+0

什麼是你的yPrc? –

+0

在你的'pred'定義中,你給出了一個單獨的向量作爲'newdata'參數。這不好。你應該像在'y_hat'定義中那樣給它一個數據框。如果這不起作用,您應該共享有關您如何創建模型的信息。代碼或「調用」應該足夠了。 – Gregor

+0

head(yPrc): [1] NaN 1.0000000 0.5000000 0.6666667 0.5000000 0.4000000 – InterruptedException

回答

1

試試這個(假設你有模型對象mdl你,還假設你的反應變量popularity有2級1(正)和0),通過應用precision定義(您可以嘗試使用一些基於kNNnon-parametric方法來聚合當前臨界截止點處的精度值,或者使用擬合曲線作爲Precision=f(Cutoff)來查找未知截止點處的精度,但這將再次近似,而不是通過定義精度來爲您提供co rrect結果):

p <- predict(mdl, newdata=ds_ts, type='response') # compute the prob that the output class label is 1 
test_cut_off <- 0.5 # this is the cut off value for which you want to find precision 
preds <- ifelse(p > test_cut_off, 1, 0) # find the class labels predicted with the new cut off 
prec <- sum((preds == 1) & (ds_ts$popularity == 1))/sum(preds == 1) # TP/(TP + FP) 

[EDITED} 試試下面這個簡單的實驗,隨機生成的數據(你可以用自己的數據測試)。

set.seed(1234) 
ds_ts <- data.frame(x=rnorm(100), popularity=sample(0:1, 100, replace=TRUE)) 
mdl <- glm(popularity~x, ds_ts, family=binomial()) 
y_hat = predict(mdl, newdata=ds_ts, type="response") 
pred = prediction(y_hat, ds_ts$popularity) 
perfPrc = performance(pred, "prec")   
xPrc = [email protected][[1]] 
yPrc = [email protected][[1]] 
plot(xPrc, yPrc, pch=19) 

enter image description here

test_cut_off <- 0.5 # this is the cut off value for which you want to find precision 

# Find the precision value corresponds to a cutoff threshold, since it's not there you can't get this way 
prc = yPrc[c(test_cut_off)] # perfPrc isn't continuous 
prC# 
# numeric(0) 

# workaround: 1-NN, use the precision at the neasrest cutoff to get an approximate precision, the one you have used should work 
nearest_cutoff_index <- which.min(abs(xPrc - test_cut_off)) 
approx_prec_at_cutoff <- yPrc[nearest_cutoff_index] 
approx_prec_at_cutoff 
# [1] 0.5294118 
points(test_cut_off, approx_prec_at_cutoff, pch=19, col='red', cex=2) 

enter image description here

紅色點表示的近似精度(其可以是正好等於實際精度,如果我們幸運的話)。

# use average precision from k-NN 
k <- 3 # 3-NN 
nearest_cutoff_indices <- sort(abs(xPrc - test_cut_off), index.return=TRUE)$ix[1:k] 
approx_prec_at_cutoff <- mean(yPrc[nearest_cutoff_indices]) 
approx_prec_at_cutoff 
# [1] 0.5294881 
points(test_cut_off, approx_prec_at_cutoff, pch=19, col='red', cex=2) 

enter image description here

p <- predict(mdl, newdata=ds_ts, type='response') 
preds <- ifelse(p > 0.5000188, 1, 0) 
actual_prec_at_cutoff <- sum((preds == 1) & (ds_ts$popularity == 1))/sum(preds == 1) # TP/(TP + FP) 
actual_prec_at_cutoff 
# [1] 0.5294118 
+0

謝謝。我寧願不直接計算它,我仍然不確定我發佈的內容出了什麼問題。 – InterruptedException

+0

沒有什麼錯,只是如果你想計算在x值中沒有的指定截斷值的精度,你需要編寫你自己的代碼來近似它,例如你可以得到最接近的截斷值精度或可能是k最近鄰居的平均值。 –

+0

如果你可以分享你的數據(或樣本),我們可以檢查出來。 –

相關問題