2016-10-06 159 views
2

也許對我的問題的回答是微不足道的,但我沒有找到正確的答案。計算r中一個調查對象中某個值的百分比/頻率

我有許多變量組成的全國調查,像這樣的(對於semplicity的緣故,我省略了一些變量):

year id y.b sex income married pens weight 
2002 1 1950 F 100000  1  0  1.12 
2002 2 1943 M 55000  1  1  0.55 
2004 1 1950 F 88000  1  1  1.1 
2004 2 1943 M 66000  1  1  0.6 
2006 3 1966 M 12000  0  1  0.23 
2008 3 1966 M 24000  0  1  0.23 
2008 4 1972 F 33000  1  0  0.66 
2010 4 1972 F 35000  1  0  0.67 

其中id是採訪的人,YB是出生年份,結婚是虛擬的(1個已婚,0個單身),如果該人投資於補充性養老金形式,則該筆是虛擬的,其價值爲1;重量是調查權重。

考慮到原始調查是由2002年至2014年的40,000個觀測值進行的(我對它進行了濾波,以便只出現一次以上的個體)。我用這個命令來創建一個調查對象:

d.s <- svydesign(ids=~1, data=df, weights=~weight) 

現在的DF加權我想找到例如婦女的比例或投資於補充養老保險已婚人士的百分比;我閱讀R的幫助和網上找到一個命令來獲得百分比,但我沒有找到正確的。

預先感謝您。

+0

所以這個比例是'投資於補充養老金/總婦女人數女人數量,對吧?對於已婚人士也是如此。你目前有什麼代碼? – blacksite

+1

Right @not_a_robot。我使用了** svytable(〜woman + obs,d.s)**,其中obs是觀察總數(我創建了一個從1到最後一個數字序列的變量obs);我還用** svymean(〜女,d.s)**和** svyratio(〜donna,〜obs,d.s)**,但我沒有得到我所需要的。 –

回答

2
# same setup 
library(survey) 

df <- data.frame(sex = c('F', 'M', 'F', 'M', 'M', 'M', 'F', 'F'), 
       married = c(1,1,1,1,0,0,1,1), 
       pens = c(0, 1, 1, 1, 1, 1, 0, 0), 
       weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67)) 

d.s <- svydesign(ids=~1, data=df, weights=~weight) 

# subset to women only then calculate the share with a pension 
svymean(~ pens , subset(d.s , sex == 'F')) 
+0

謝謝你的回答。其實它更容易! –

+0

其實它是正確的。 – StasK

0

我並不確切地知道你想要weight做什麼,但這裏是dplyr婦女與退休金的比例非常簡單的解決辦法:

df <- data.frame(sex = c('F', 'M', 'F', 'M', 'M', 'M', 'F', 'F'), 
       married = c(1,1,1,1,0,0,1,1), 
       pens = c(0, 1, 1, 1, 1, 1, 0, 0), 
       weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67)) 

d.s <- svydesign(ids=~1, data=df, weights=~weight) 

# data frame of women with a pension 
women_with_pension <- d.s$variables %>% 
    filter(sex == 'F' & pens == 1) 

# number of rows (i.e. number of women with a pension) in that df 
n_women_with_pension <- nrow(women_with_pension) 

# data frame of all women 
all_women <- d.s$variables %>% 
    filter(sex == 'F') 

# number of rows (i.e. number of women) in that df 
n_women <- nrow(all_women) 

# divide the number of women with a pension by the total number of women 
proportion_women_with_pension <- n_women_with_pension/n_women 

這會給你一個有養老金的婦女的基本比例。應用這個相同的邏輯來獲得有養老金的已婚人口的比例。

weight變量而言,你是否試圖做某種加權比例?在這種情況下,你會總結爲女性weight值中的每個類(養老金和所有的女性),像這樣:

# data frame of women with a pension 
women_with_pension <- d.s$variables %>% 
    filter(sex == 'F' & pens == 1) %>% 
    summarise(total_weight = sum(weight)) 

# number of rows (i.e. number of women with a pension) in that df 
women_with_pension_weight = women_with_pension[[1]] 

# data frame of all women 
all_women <- d.s$variables %>% 
    filter(sex == 'F') %>% 
    summarise(total_weight = sum(weight)) 

# number of rows (i.e. number of women) in that df 
all_women_weight <- all_women[[1]] 

# divide the number of women with a pension by the total number of women 
# 0.3098592 for this sample data 
prop_weight_women_with_pension <- women_with_pension_weight/all_women_weight 
+1

謝謝你,你的答案是我正在尋找的。我想使用權重來獲得樣本的正確表示(因爲調查是在樣本上進行的,因此使用調查權重應該可以更好地表示整個人口)。 –

+1

@LauraR。我是低調的,因爲這種闖入調查對象的策略是荒謬的。並且不允許用戶計算置信區間。看到我的回答 –