2017-07-15 70 views
0

欲得到兩個無規分佈的觀測x和y的P值,例如:R:計算的隨機分佈的P值

> set.seed(0) 
> x <- rnorm(1000, 3, 2) 
> y <- rnorm(2000, 4, 3) 

或:

> set.seed(0) 
> x <- rexp(50, 10) 
> y <- rexp(100, 11) 

假設T是我的測試統計量,定義爲mean(x) - mean(y)= 0(這是H0),那麼P值定義爲:p-value = P [T> T_observed | H0成立]。
我試着這樣做:

> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution 
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y) 

然後計算p值我想這:

> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is 
supposed to get the null distribution of mean(x) - mean(y) 
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5% 

顯然,這似乎並沒有工作,我失去了什麼?

回答

0

您的意圖非常好 - 通過自舉採樣(aka bootstrapping)來計算統計顯着性。但是,平均值(樣本(x,1000,TRUE)) - 平均值(樣本(z,2000,TRUE))無法正常工作,因爲這需要平均1000個z樣本 - 平均2000個z樣本。無論x和y的真實方式如何,這肯定會非常接近0。

我建議如下:x和y的

diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE)) 

2000樣品(與替換)採取並計算差值。當然你也可以按照你的建議增加重複次數來增加信心。與pvalue相比,我更喜歡置信區間(confidence interval,CI),因爲我認爲它們更具信息性(與p值相比統計準確度相當)。使用平均值和標準誤差如下順然後可以計算:

stderror <- sd(diff)/sqrt(length(x)) 
upperCI <- mean(diff)+stderror 
lowerCI <- mean(diff)-stderror 
cat(lowerCI, upperCI) 

由於CI不包括0時,零假設被拒絕。請注意,結果將接近t檢驗(對於您的正常示例)CI結果R:

t <- t.test(x, y) 
cat(t$conf.int)