2016-02-26 106 views
2

我想模擬一些R中缺失的數據,但遇到了麻煩。我已經創建了兩個變量(「前」和「後」),它們代表了同一個人治療前後的測量結果(即配對數據)。我已經能夠爲隨機(MCAR)完全丟失的數據做到這一點 - 見下文,但我無法弄清楚如何將它編碼爲隨機丟失(MAR)。對於MAR缺失數據,我想根據治療前觀察結果創建3個類別,這將決定缺失多少個治療後觀察結果。即如何在R中模擬MAR缺失數據?

對於預> 25,40%後失蹤
對於預> 21和≤25,30%後失蹤
對於預≤21,20%後失蹤

誰能幫幫忙? (我會非常感謝!)
感謝

set.seed(80122) 
n <- 1000 

# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7. 
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix 

# Split into pre and post treatment and check correlation is what was specified 
pre <- data[, 1] 
post <- data[, 2] 
cor.test(pre,post) 

# Simulate MCAR 
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's 
post_mcar <- post 
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector 
post_mcar[mcar == 0] <- NA # Change zero's to NAs 

回答

0

這是一個老問題,但我想我會帶裂紋它。

模擬假數據作爲OP:

library(tidyverse) 
library(mvtnorm) 

# Number of data values 
n <- 1000 

# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7. 
set.seed(80122) 
data <- rmvnorm(n, mean=c(28,18), 
       sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix 

轉換爲數據幀:

data = as.data.frame(data) 
names(data) = c("pre", "post") 

模擬隨機(MCAR)數據完全丟失:

data$post_mcar <- data$post 

set.seed(2) 
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA 

模擬隨機丟失的數據(MAR)數據:首先,我們將創建一個分組變量frac,其值是我們想要設置爲丟失的組的一部分。我們將使用cut函數來創建這些組,並設置標籤值,然後我們將轉換標籤數值供以後使用:通過frac

data = data %>% 
    mutate(post_mar = post, 
     frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf), 
              labels=c(0.2,0.3,0.4))))) 

現在,組與組的隨機選擇部分值爲NA,使用frac來確定設置爲NA的值的分數。

set.seed(3) 
data = data %>% 
    group_by(frac) %>% 
    mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>% 
    ungroup 

下面是過去的6行所產生的數據幀的:

  pre  post post_mcar post_mar frac 
995 28.63476 19.35081 19.35081 19.35081 0.4 
996 32.86278 24.16119  NA  NA 0.4 
997 28.25965 16.64538 16.64538 16.64538 0.4 
998 24.35255 17.80365 17.80365 17.80365 0.3 
999 28.12426 18.25222 18.25222  NA 0.4 
1000 27.55075 14.47757 14.47757 14.47757 0.4 

這裏的每個組中的缺失值的分數的檢查。請注意,如果請求的百分比不會導致整數行,則設置爲缺少值的實際百分比可能與frac不同。例如,在這裏,沒有辦法選擇8個值中的20%。它可以是12.5%(1值)或25%(2值)。

data %>% group_by(frac) %>% 
    summarise(N=n(), 
      N_missing=sum(is.na(post_mar)), 
      Frac_missing=N_missing/N) 
frac N N_missing Frac_missing 
1 0.2 8   2 0.2500000 
2 0.3 138  41 0.2971014 
3 0.4 854  342 0.4004684