2017-10-18 113 views
1

我想有四個預測我在哪裏可以自由指定用固定R2模擬多元迴歸數據:如何合併相關變量?

  • 整體解釋模型
  • 所有標準化迴歸係數的大小的變化來模擬數據進行多元線性迴歸
  • 的預測變量彼​​此相關的程度

我到達了滿足前兩點的解決方案,但是基於所有ind自變量彼此無關(請參閱下面的代碼)。爲了得到標準化的迴歸係數,我從平均值= 0和方差= 1的總體變量中抽樣。

# Specify population variance/covariance of four predictor variables that is sampled from 
sigma.1 <- matrix(c(1,0,0,0, 
        0,1,0,0, 
        0,0,1,0,  
        0,0,0,1),nrow=4,ncol=4) 
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance 
n.obs <- 50000 # to avoid sampling error problems 
intercept <- 0.5 
beta <- c(0.4, 0.3, 0.25, 0.25) 
r2 <- 0.30 

# Create sample with four predictor variables 
library(MASS) 
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE)) 

# Add error variable based on desired r2 
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2) 
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon)) 

# Add y variable based on true coefficients and desired r2 
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon 

# Inspect model 
summary(lm(y~V1+V2+V3+V4, data=sample1)) 

Call: 
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1) 

Residuals: 
    Min  1Q Median  3Q  Max 
-4.0564 -0.6310 -0.0048 0.6339 3.7119 

Coefficients: 
      Estimate Std. Error t value Pr(>|t|)  
(Intercept) 0.496063 0.004175 118.82 <2e-16 *** 
V1   0.402588 0.004189 96.11 <2e-16 *** 
V2   0.291636 0.004178 69.81 <2e-16 *** 
V3   0.247347 0.004171 59.30 <2e-16 *** 
V4   0.253810 0.004175 60.79 <2e-16 *** 
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9335 on 49995 degrees of freedom 
Multiple R-squared: 0.299, Adjusted R-squared: 0.299 
F-statistic: 5332 on 4 and 49995 DF, p-value: < 2.2e-16 

問題:如果我的預測變量是相關的,因此,如果沒有非對角線元素爲0指定其方差/協方差矩陣中,R2和迴歸係數主要來自我多麼希望他們能有所不同,例如通過使用

sigma.1 <- matrix(c(1,0.25,0.25,0.25, 
        0.25,1,0.25,0.25, 
        0.25,0.25,1,0.25,  
        0.25,0.25,0.25,1),nrow=4,ncol=4) 

有什麼建議嗎? 謝謝!

回答

1

在考慮了我的問題之後,我找到了答案。

上面的代碼首先對預測變量進行採樣,並給定相互之間的相關度。然後根據r2的期望值添加錯誤列。然後將所有這些加在一起,爲y添加一列。

到目前爲止,創建錯誤的行僅僅是

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2) 
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon)) 

所以它假定每貝塔係數貢獻100%至y的解釋(=沒有獨立變量的相互關係)。但是如果x變量是相關的,那麼每個beta都不是(!)貢獻100%。這意味着誤差的方差必須更大,因爲這些變量需要相互之間的一些變化。

多大?剛剛適應誤差項的創建類似如下:

var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2) 

所以在何種程度上獨立的變量加入cor(sample1$V1, sample1$V2)相關的只是添加到誤差方差。在相互關係爲0.25的情況下,例如,通過使用

sigma.1 <- matrix(c(1,0.25,0.25,0.25, 
       0.25,1,0.25,0.25, 
       0.25,0.25,1,0.25,  
       0.25,0.25,0.25,1),nrow=4,ncol=4) 

cor(sample1$V1, sample1$V2)類似於0.25並且這個值被加到誤差項的方差。

假設所有的相互關係相同,像這樣,可以指定獨立變量之間的任何程度的相互關係,連同真正的標準化迴歸係數和所需的R2。

證明:

sigma.1 <- matrix(c(1,0.35,0.35,0.35, 
        0.35,1,0.35,0.35, 
        0.35,0.35,1,0.35,  
        0.35,0.35,0.35,1),nrow=4,ncol=4) 
# Specify population means of four predictor varialbes that is sampled from 
mu.1 <- rep(0,4) 

# Specify sample size, true regression coefficients, and explained variance 
n.obs <- 500000 # to avoid sampling error problems 
intercept <- 0.5 
beta <- c(0.4, 0.3, 0.25, 0.25) 
r2 <- 0.15 

# Create sample with four predictor variables 
library(MASS) 
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE)) 

# Add error variable based on desired r2 
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2) 
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon)) 

# Add y variable based on true coefficients and desired r2 
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 + 
    beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon 

# Inspect model 
summary(lm(y~V1+V2+V3+V4, data=sample1)) 

> summary(lm(y~V1+V2+V3+V4, data=sample1)) 

Call: 
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1) 

Residuals: 
    Min  1Q Median  3Q  Max 
-10.7250 -1.3696 0.0017 1.3650 9.0460 

Coefficients: 
      Estimate Std. Error t value Pr(>|t|)  
(Intercept) 0.499554 0.002869 174.14 <2e-16 *** 
V1   0.406360 0.003236 125.56 <2e-16 *** 
V2   0.298892 0.003233 92.45 <2e-16 *** 
V3   0.247581 0.003240 76.42 <2e-16 *** 
V4   0.253510 0.003241 78.23 <2e-16 *** 
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 2.028 on 499995 degrees of freedom 
Multiple R-squared: 0.1558, Adjusted R-squared: 0.1557 
F-statistic: 2.306e+04 on 4 and 499995 DF, p-value: < 2.2e-16