2012-02-24 62 views
0

這是從stats.stackexchange轉發,我沒有得到滿意的答覆。我有兩個數據集,第一個在學校,第二個列出每個學校誰在標準化測試(強調故意)失敗的學生。假數據集可以通過(感謝Tharen)產生:R:分層數據的貝葉斯邏輯迴歸

#random school data for 30 schools 
schools.num = 30 
schools.data = data.frame(school_id=seq(1,schools.num) 
         ,tot_white=sample(100:300,schools.num,TRUE) 
         ,tot_black=sample(100:300,schools.num,TRUE) 
         ,tot_asian=sample(100:300,schools.num,TRUE) 
         ,school_rev=sample(4e6:6e6,schools.num,TRUE) 
         ) 

#total students in each school 
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian 
#sum of all students all schools 
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian) 
#generate some random failing students 
fail.num = as.integer(tot_students * 0.05) 

students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE) 
         ,school_id=sample(1:schools.num, fail.num, TRUE) 
         ,race=sample(c('white', 'black', 'asian'), fail.num, TRUE) 
        ) 

我想估計P(失敗= 1 |學生種族,學校收入)。如果我在學生數據集上運行多項式離散選擇模型,我將明確地估計P(Race | Fail = 1)。我顯然必須估計這個的倒數。由於所有信息都可以在兩個數據集中獲得(P(失敗),P(競賽),收入),我沒有理由不能做到這一點。但是我很難理解如何在R中實現。任何指針都會非常感謝。謝謝。

回答

1

如果您有一個數據框架,它會更容易。

library(reshape2) 
library(plyr) 
d1 <- ddply(
    students, 
    c("school_id", "race"), 
    summarize, 
    fail=length(student_id) 
) 
d2 <- with(schools.data, data.frame( 
    school_id = school_id, 
    white = tot_white, 
    black = tot_black, 
    asian = tot_asian, 
    school_rev = school_rev 
)) 
d2 <- melt(d2, 
    id.vars=c("school_id", "school_rev"), 
    variable.name="race", 
    value.name="total" 
) 
d <- merge(d1, d2, by=c("school_id", "race")) 
d$pass <- d$total - d$fail 

然後你可以看一下數據

library(lattice) 
xyplot(d$fail/d$total ~ school_rev | race, data=d) 

或計算你想要的任何東西。

r <- glm(
    cbind(fail,pass) ~ race + school_rev, 
    data=d, 
    family=binomial() # Logistic regression (not bayesian) 
) 
summary(r) 

(編輯)如果您有關於失敗的學生, 但只有彙總數據的傳遞者的更多信息, 你可以重新創建一個完整的數據集如下。

# Unique student_id for the passed students 
d3 <- ddply(d, 
    c("school_id", "race"), 
    summarize, student_id=1:pass 
) 
d3$student_id <- - seq_len(nrow(d3)) 
# All students 
d3$result <- "pass" 
students$result <- "fail" 
d3 <- merge(# rather than rbind, in case there are more columns 
    d3, students, 
    by=c("student_id", "school_id", "race", "result"), 
    all=TRUE 
) 
# Students and schools in a single data.frame 
d3 <- merge(d3, schools.data, by="school_id", all=TRUE) 
# Check that the results did not change 
r <- glm(
    (result=="fail") ~ race + school_rev, 
    data=d3, 
    family=binomial() 
) 
summary(r) 
+0

文森特,謝謝你。父母收入表示,到學校級別的問題是,我不能包括額外的學生級別特徵。這就是爲什麼我想要一個明確的分層估計逆概率的方法。 – user702432 2012-02-24 08:13:57

+0

在這種情況下,我仍然建議將所有內容放在同一個data.frame (包括school_id,student_id,race,result,school_rev等), ,但是您還需要通過測試的學生的行。 – 2012-02-24 08:24:34

+0

這就是問題所在。我在學生層面有一個截斷的樣本 - 這就是爲什麼我想要沿着混合建模的思路想一些東西。 – user702432 2012-02-24 08:28:38

0

您需要一個包含所有學生信息的數據集。兩者都失敗並通過。

schools.num = 30 
schools.data = data.frame(school_id=seq(1,schools.num) 
          ,tot_white=sample(100:300,schools.num,TRUE) 
          ,tot_black=sample(100:300,schools.num,TRUE) 
          ,tot_asian=sample(100:300,schools.num,TRUE) 
          ,school_rev=sample(4e6:6e6,schools.num,TRUE) 
         ) 

library(plyr) 
fail_ratio <- 0.05 
dataset <- ddply(schools.data, .(school_id, school_rev), function(x){ 
    data.frame(Fail = rbinom(sum(x$tot_white, x$tot_asian, x$tot_black), size = 1, prob = fail_ratio), Race = c(rep("white", x$tot_white), rep("asian", x$tot_asian), rep("black", x$tot_black))) 
}) 
dataset$Race <- factor(dataset$Race) 

然後,您可以使用glmer()作爲lme4包的頻率方法。

library(lme4) 
glmer(Fail ~ school_rev + Race + (1|school_id), data = dataset, family = binomial) 

如果您需要貝葉斯估計,請查看MCMCglmm軟件包。