2017-12-03 185 views
4

我一直在使用SAS很長一段時間,現在我想翻譯我的代碼在R.我需要幫助做到以下幾點:蒙特卡羅模擬,引導和迴歸中的R

  1. 生成若干自舉樣品
  2. 運行在每個樣品
  3. 存儲在一個新的數據集的參數的線性迴歸模型的複製樣品

我編輯的代碼,更清晰。 我用了很多for循環,我知道並不總是推薦。這個過程非常緩慢

是否有代碼/軟件包(例如,應用家庭功能,「caret」軟件包)可以使這個非常乾淨的高效/快,尤其是如samplesize * bootsample> 1000萬

任何幫助將不勝感激。

samplesize <- 200 
bootsize<- 500 
myseed <- 123 

#generating a fake dataset 
     id=1:n 
     set.seed(myseed) 
     x <- rnorm(samplesize, 5, 5) 
     y <- rnorm(samplesize, 2 + 0.4*x, 0.5) 
     data <- data.frame(id, x, y) 

head(data) 
    id   x  y 
1 1 2.197622 3.978454 
2 2 3.849113 4.195852 
3 3 12.793542 6.984844 
4 4 5.352542 4.412614 
5 5 5.646439 4.051405 
6 6 13.575325 7.192007 

# generate bootstrap samples 

bootstrap <- function(nbootsamples, data, seed) { 
    bootdata <- data.frame() #to initialize it 
    set.seed(seed) 
    for (i in 1:nbootsamples) { 
    replicate <- i 
    bootstrapIndex <- sample(1:nrow(data), replace = TRUE) 
    datatemp <- data[bootstrapIndex, ] 
    tempall <- cbind(replicate, datatemp) 
    bootdata <- rbind(bootdata, tempall) 
    } 
    return(bootdata) 
} 
bootdata <- bootstrap(nbootsamples=bootsize, data=data, seed=myseed) 
bootdata <- dplyr::arrange(bootdata, replicate, id) 
head(bootdata) 
#The data should look like this 
    replicate id   x  y 
1   1 1 2.197622 3.978454 
2   1 3 12.793542 6.984844 
3   1 5 5.646439 4.051405 
4   1 9 1.565736 3.451748 
5   1 10 2.771690 3.081662 
6   1 10 2.771690 3.081662 

#Model-fitting and saving coefficient and means 

modelFitting <- function(y, x, data) { 
    modeltemp <- glm(y ~ x, 
    data = data, 
    family = gaussian('identity')) 
    Inty <- coef(modeltemp)["(Intercept)"] 
    betaX <- coef(modeltemp)["x"] 
    sdy <- sd(residuals.glm(modeltemp)) 
    data.frame(Inty, betaX, sdy, row.names = NULL) 
} 

saveParameters <- function(nbootsamples, data, seed) { 
    parameters <- data.frame() #to initialize it 
    for (i in 1:length(unique(data$replicate))) { 
    replicate <- i 
    datai <- data[ which(data$replicate==i),] 
    datatemp <- modelFitting(y, x,data=datai) 
    meandata <- data.frame(Pr_X=mean(datai$x)) 
    tempall <- cbind(replicate, datatemp, meandata) 
    parameters <- rbind(parameters, tempall) 
    } 
    return(parameters) 
} 
parameters <- saveParameters(nbootsamples=bootsize, data=bootdata, seed=myseed) 
head(parameters) 

#Ultimately all I want is my final dataset to look like the following 

    replicate  Inty  betaX  sdy  Pr_X 
1   1 2.135529 0.3851757 0.5162728 4.995836 
2   2 1.957152 0.4094682 0.5071635 4.835884 
3   3 2.044257 0.3989742 0.4734178 5.111185 
4   4 2.093452 0.3861861 0.4921470 4.741299 
5   5 2.017825 0.4037699 0.5240363 4.931793 
6   6 2.026952 0.3979731 0.4898346 5.502320 
+1

支持對於連續因變量,不應該家族=高斯(「身份」)? –

+0

謝謝你Len Greski。我會修復它 – SimRock

回答

2

使用caret包輕鬆實現了重採樣迴歸。根據您的示例數據,通過廣義線性模型運行200個bootstrap樣本的代碼如下所示。從插入符號

library(caret) 
x = round(rnorm(200, 5, 5)) 
y= rnorm(200, 2 + 0.4*x, 0.5) 
theData <- data.frame(id=1:200,x, y) 
# configure caret training parameters to 200 bootstrap samples 
fitControl <- trainControl(method = "boot", 
          number = 200) 
fit <- train(y ~ x, method="glm",data=theData, 
      trControl = fitControl) 
# print output object 
fit 
# print first 10 resamples 
fit$resample[1:10,] 

輸出如下:有關如何使用插入符號,包括生成的模型對象的內容

> fit 
Generalized Linear Model 

200 samples 
    1 predictor 

No pre-processing 
Resampling: Bootstrapped (200 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results: 

    RMSE  Rsquared MAE  
    0.4739306 0.9438834 0.3772199 

> fit$resample[1:10,] 
     RMSE Rsquared  MAE Resample 
1 0.5069606 0.9520896 0.3872257 Resample001 
2 0.4636029 0.9460214 0.3711900 Resample002 
3 0.4446103 0.9549866 0.3435148 Resample003 
4 0.4464119 0.9443726 0.3636947 Resample004 
5 0.5193685 0.9191259 0.4010104 Resample005 
6 0.4995917 0.9451417 0.4044659 Resample006 
7 0.4347831 0.9494606 0.3383224 Resample007 
8 0.4725041 0.9483434 0.3716319 Resample008 
9 0.5295650 0.9458453 0.4241543 Resample009 
10 0.4796985 0.9514595 0.3927207 Resample010 
> 

細節(例如訪問各個車型,因此您可以生成與predict()功能預測爲模擬)可在caret GitHub site

卡特也支持並行處理。有關如何使用脫字符並行處理的示例,請參閱Improving Performance of Random Forest with caret::train()

此外,蒙特卡洛仿真中的R通過Monte Carlo包在R.

+0

謝謝@倫格斯基。爲了更清晰地編輯問題和代碼。我嘗試了'caret package',但我無法弄清楚如何得到想要的結果。任何建議,謝謝 – SimRock