2017-07-10 68 views
3

我一直在嘗試使用cv.glmnet來適合套索模型。基於標準化,我試圖實施四種不同的模型(使用cv.glmnet 3和使用caret::train)。所有這四個模型都給出了非常不同的係數估計值,我不知道爲什麼。爲什麼在具有相同輸入參數的模型之間,glmnet的係數估計值會有很大變化?

這裏是一個完全可再現的代碼:

library("glmnet") 
data(iris) 
iris <- iris 
dat <- iris[iris$Species %in% c("setosa","versicolor"),] 
X <- as.matrix(dat[,1:4]) 
Y <- as.factor(as.character(dat$Species)) 

set.seed(123) 
model1 <- cv.glmnet(x = X, 
        y = Y, 
        family = "binomial", 
        standardize = FALSE, 
        alpha = 1, 
        lambda = rev(seq(0,1,length=100)), 
        nfolds = 3) 

set.seed(123) 
model2 <- cv.glmnet(x = scale(X, center = T, scale = T), 
        y = Y, 
        family = "binomial", 
        standardize = FALSE, 
        alpha = 1, 
        lambda = rev(seq(0,1,length=100)), 
        nfolds = 3) 
set.seed(123) 
model3 <- cv.glmnet(x = X, 
        y = Y, 
        family = "binomial", 
        standardize = TRUE, 
        alpha = 1, 
        lambda = rev(seq(0,1,length=100)), 
        nfolds = 3) 

##Using caret 
library("caret") 

lambda.grid <- rev(seq(0,1,length=100)) #set of lambda values for cross-validation 
alpha.grid <- 1 #alpha 
trainControl <- trainControl(method ="cv", 
          number=3) #3-fold cross-validation 
tuneGrid <- expand.grid(.alpha=alpha.grid, .lambda=lambda.grid) #these are tuning parameters to be passed into the train function below 

set.seed(123) 
model4 <- train(x = X, 
       y = Y, 
       method="glmnet", 
       family="binomial", 
       standardize = FALSE, 
       trControl = trainControl,       
       tuneGrid = tuneGrid) 

c1 <- coef(model1, s=model1$lambda.min) 
c2 <- coef(model2, s=model2$lambda.min) 
c3 <- coef(model3, s=model3$lambda.min) 
c4 <- coef(model4$finalModel, s=model4$finalModel$lambdaOpt) 
c1 <- as.matrix(c1) 
c2 <- as.matrix(c2) 
c3 <- as.matrix(c3) 
c4 <- as.matrix(c4) 

model2縮放獨立變量(矢量X)通過設置standardize = TRUE預先和model3這樣做。所以至少這兩個模型應該返回相同的結果 - 但事實並非如此。

從四個模型獲得的lambda.min是:

model1 = 0 

model2 = 0 

model3 = 0 

model4 = 0.6565657 

的模型之間的係數估計值急劇差別太大。爲什麼會發生這種情況?

+0

'glmnet'的標準化由Fortran代碼完成,所以很難判斷這個和scale是否實際上是100%做同樣的事情。 – JAD

+0

無論使用哪種編程語言,scale都應該標準化數據。這意味着通過相應的列平均值減去每列值,並將列標準差除以具有單位方差和零平均值。不太明白爲什麼事情應該這麼複雜,當它不應該是:-( – technOslerphile

+0

對於比較c2到c3:在'?glmnet'' standardize'參數;當TRUE ... *「係數總是返回(xs,「scaled:scale」); ce(x); scale(x); sx = attr(xs,「scaled:scale」); ce = attr(xs,「scaled:center」); co = as.numeric(c2); co [-1]/sx; co [1] - sum((co [-1]/sx)* sx)' - 這是更接近 – user20650

回答

0

實際上scale(x) & standardize = FALSEx & standardize = TRUE之間有一點點不同。我們需要多個(N-1)/ N。

參見here

如果我們使用高斯家庭,

library(glmnet) 
X <- matrix(runif(100, 0, 1), ncol=2) 
y <- 1 -2*X[,1] + X[,2] 

enet <- glmnet(X, y, lambda=0.1,standardize = T,family="gaussian") 
coefficients(enet) 
coef <- coefficients(enet) 
coef[2]*sd(X[,1])/sd(y) #standardized coef 
#[1] -0.6895065 

enet1 <- glmnet(scale(X)/99*100, y/(99/100*sd(y)),lambda=0.1/(99/100*sd(y)),standardize = F,family="gaussian") 
coefficients(enet1)[2] 
#[1] -0.6894995 

如果我們用二項式家庭,

data(iris) 
iris <- iris 
dat <- iris[iris$Species %in% c("setosa","versicolor"),] 
X <- as.matrix(dat[,1:4]) 
Y <- as.factor(as.character(dat$Species)) 

set.seed(123) 
model1 <- cv.glmnet(x = X, 
       y = Y, 
       family = "binomial", 
       standardize = T, 
       alpha = 1, 
       lambda = rev(seq(0,1,length=100)), 
       nfolds = 3) 
coefficients(model1,s=0.03)[3]*sd(X[,2]) 
#[1] -0.3374946 

set.seed(123) 
model3 <- cv.glmnet(x = scale(X)/99*100, 
       y = Y, 
       family = "binomial", 
       standardize = F, 
       alpha = 1, 
       lambda = rev(seq(0,1,length=100)), 
       nfolds = 3) 
coefficients(model3,s=0.03)[3] 
#[1] -0.3355027 

這些結果幾乎相同。希望這個答案還爲時不晚。

相關問題