2014-01-05 65 views
0

在模擬數據基於值的數據幀的列(多個)條件的變化設定到其他列中

n = 50 
set.seed(378) 
df <- data.frame(
    age = sample(c(20:90), n, rep = T), 
    sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)), 
    smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)), 
    py = abs(rnorm(n, 25, 10)), 
    yrsquit = abs (rnorm (n, 10,2)), 
    outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2))) 
) 

我需要引入的結果組之間有些不平衡(1 =疾病,0 =無疾病) 。例如,患有該疾病的受試者年齡更大並且更可能是男性。我試圖

df1 <- within(df, sapply(length(outcome), function(x) { 
if (outcome[x] == 1) { 
    age[x] <- age[x] + 15 
    sex[x] <- sample(c("m","f"), prob=c(0.8,0.2)) 
} 
})) 

但如圖

tapply(df$sex, df$outcome, length) 
tapply(df1$sex, df$outcome, length) 
tapply(df$age, df$outcome, mean) 
tapply(df1$age, df$outcome, mean) 

回答

2

within採用sapply像您期望的不工作沒有什麼區別。函數within只使用返回值sapply。但在你的代碼中,sapply返回NULL。因此,within不會修改數據幀。

這裏是修改數據幀沒有環或sapply更簡單的方法:

idx <- df$outcome == "1" 
df1 <- within(df, {age[idx] <- age[idx] + 15; 
        sex[idx] <- sample(c("m", "f"), sum(idx), 
             replace = TRUE, prob = c(0.8, 0.2))}) 

現在,數據幀是不同的:

> tapply(df$age, df$outcome, mean) 
     0  1 
60.46341 57.55556 
> tapply(df1$age, df$outcome, mean) 
     0  1 
60.46341 72.55556 

> tapply(df$sex, df$outcome, summary) 
$`0` 
f m 
24 17 

$`1` 
f m 
2 7 

> tapply(df1$sex, df$outcome, summary) 
$`0` 
f m 
24 17 

$`1` 
f m 
1 8