計算兩個數據集的置信區間

我想計算置信區間（99％的水平，而不是95）爲可變兩歲dataframes，infert_control和infert_patient的平均值，其中：

infert_control = subset(infert$age, infert$case == 0) 
infert_patient = subset(infert$age, infert$case == 1)

infert是一個內置的R數據集，對於那些不熟悉它的人，這裏是：病例0命名爲對照組患者，病例1實際的。

> infert 
    education age parity induced case spontaneous stratum pooled.stratum 
1  0-5yrs 26  6  1 1   2  1    3 
2  0-5yrs 42  1  1 1   0  2    1 
3  0-5yrs 39  6  2 1   0  3    4 
4  0-5yrs 34  4  2 1   0  4    2 
5  6-11yrs 35  3  1 1   1  5    32 
6  6-11yrs 36  4  2 1   1  6    36 
7  6-11yrs 23  1  0 1   0  7    6 
8  6-11yrs 32  2  0 1   0  8    22 
9  6-11yrs 21  1  0 1   1  9    5 
10 6-11yrs 28  2  0 1   0  10    19 
11 6-11yrs 29  2  1 1   0  11    20 
... 
239 12+ yrs 38  6  0 0   2  74    63 
240 12+ yrs 26  2  1 0   1  75    49 
241 12+ yrs 31  1  1 0   0  76    45 
242 12+ yrs 31  2  0 0   1  77    53 
243 12+ yrs 25  1  0 0   1  78    41 
244 12+ yrs 31  1  0 0   1  79    45 
245 12+ yrs 34  1  0 0   0  80    47 
246 12+ yrs 35  2  2 0   0  81    54 
247 12+ yrs 29  1  0 0   1  82    43 
248 12+ yrs 23  1  0 0   1  83    40

什麼是解決此問題的正確方法？

我已經計算age列的兩個infert_control和infert_patient，加上每個子集的標準偏差的平均值。

來源

2014-01-19 Dimitris Sfounis

這是一個統計問題。但你可以報告正確的'分位數' – rawr

也許't.test'可以幫到你嗎？ – Fernando

@Fernando由於本練習的第二部分需要't.test（）'，所以我認爲我不應該在這部分中使用它，這是第一部分。這整個事情是一個介紹到R的任務。 –

你可以很容易地計算手動置信區間：

infert_control <- subset(infert$age, infert$case == 0) 

# calculate needed values 
m <- mean(infert_control) 
s <- sd(infert_control) 
n <- length(infert_control) 

# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution) 
a <- 0.995 # 99% CI => 0.5% on both sides 
error <- qnorm(a)*s/sqrt(n) 

# calculate CI 
ci_lower <- m-error 
ci_upper <- m+error

參見http://en.wikipedia.org/wiki/Confidence_interval（對不起，維基百科的鏈接，但它有一個很好的解釋和說明你的公式）

來源

2014-01-19 18:19:11 sgibb

非常感謝您的幫助。讓我研究你的答案。 –

...或小功能：

cifun <- function(data, ALPHA){ 
    c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)), 
    mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data))) 
} 

cifun(infert_control, 0.01)

來源

2014-01-19 18:27:02 holzben

您可以使用引導這個：

library(boot) 
set.seed(42) 
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4) 
quantile(boot_mean$t, probs=c(0.005, 0.995)) 
#  0.5% 99.5% 
# 30.47273 32.58182

或者，如果你不想使用圖書館：

set.seed(42) 
R <- 1e4 
boot_mean <- colMeans(
       matrix(
        sample(infert_control, R * length(infert_control), TRUE), 
        ncol=R)) 
quantile(boot_mean, probs=c(0.005, 0.995)) 
# 0.5% 99.5% 
#30.42424 32.55152

來源

2014-01-19 18:33:52 Roland

我很抱歉，但我不熟悉bootstrap庫。 –

那麼，請閱讀文檔或手動執行（請參閱我的編輯）。 – Roland

所以很多答案...

隨機抽樣的平均值具有t分佈，不正常，儘管t -> N爲df -> Inf。

cl <- function(data,p) { 
    n <- length(data) 
    cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n) 
    m <- mean(data) 
    return(c(lower=m-cl,upper=m+cl)) 
} 
cl.control <- cl(infert_control,0.01) 
cl.control 
# lower upper 
# 30.42493 32.55689 

cl.patient <- cl(infert_patient,0.01) 
cl.patient 
# lower upper 
# 30.00221 33.05803 

aggregate(age~case,data=infert,cl,p=0.01) # much better way... 
# case age.lower age.upper 
# 1 0 30.42493 32.55689 
# 2 1 30.00221 33.05803

此外，分位數函數（e.q. qt(...)和qnorm(...)）默認返回下尾，所以你的限制將被逆轉，除非你設置lower.tail=F

來源

2014-01-19 18:45:57 jlhoward

非常感謝@jhoward。看起來這將幫助我完成練習的第二部分。 –

計算兩個數據集的置信區間

回答

相關問題