2017-03-09 42 views
1

我使用名爲logistf的R包進行Logistc迴歸,並且我看到這個包中沒有預測新數據的函數,並且預測包不能用於此,所以我找到了一個代碼使這一新的數據:Predict Logistf

fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData) 
betas <- coef(fit) 
X <- model.matrix(fit, data=testData) 
probs <- 1/(1 + exp(-X %*% betas)) 

我希望做一個交叉驗證的版本與此使用擬合預測$probs產生對我的概率。有沒有人曾經做過這樣的事情?

,我想知道的另一件事是關於適合$預測我正在做一個二元邏輯迴歸,而這個函數返回多個值,是這些值在0級或1,我怎麼能知道呢?由於

回答

0

雖然你寫的作品完美的代碼,也得到同樣的結果看似一個簡潔的方式:

brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train) 
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response") 

關於簡歷,你必須寫的幾行代碼我猜:

#Setting the number of folds, and number of instances in each fold 
n_folds <- 5 
fold_size <- nrow(dataset) %/% 5 
residual <- nrow(dataset) %% 5 

#label the instances based on the number of folds 
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual)) 

# the error term would differ based on each threshold value 
t_seq <- seq(0.1,0.9,by = 0.1) 
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq)) 
index_mat[,1] <- t_seq 

# the main loop for calculation of the CV error on each fold 
for (i in 1:5){ 
     train <- dataset %>% filter(cv_labels != i) 
     test <- dataset %>% filter(cv_labels == i) 

     brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train) 
     brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response") 

     # error formula that you want, e.g. misclassification 
     counter <- 0 

     for (treshold in t_seq) { 
       counter <- counter + 1 
       conf_mat <- table(factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE"))) 

       sen <- conf_mat[2,2]/sum(conf_mat[2,]) 

       # other indices can be computed as follows 
       #spec <- conf_mat[1,1]/sum(conf_mat[1,]) 
       #prec <- conf_mat[2,2]/sum(conf_mat[,2]) 
       #F1 <- (2*prec * sen)/(prec+sen) 
       #accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat) 

       #here I am only interested in sensitivity 
       index_mat[counter,(i+1)] <- sen 

     } 

} 

# final data.frame would be the mean of sensitivity over each threshold value 
final_mat <- matrix(nrow = length(t_seq), ncol = 2) 
final_mat[,1] <- t_seq 
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean) 
final_mat <- data.frame(final_mat) 
colnames(final_mat) <- c("treshold","sensitivity") 

#why not having a look at the CV-sensitivity of the model over threshold values? 
ggplot(data = final_mat) + 
     geom_line(aes(x = treshold, y = sensitivity), color = "blue") 
相關問題