2017-07-06 94 views
0

我試着計算每個組內每個觀察值的連續變量(我們稱之爲'值')的分位數(0至100)在一個新的變量中觀察其相應的分位數。R:按賦值分組估計加權分位數

換句話說,每一行是一個觀察,每個觀察屬於一個組。所有的小組都有兩個以上的觀察結果。在每個組中,我需要使用我的數據中的抽樣權重來估計值的分佈,確定觀察值位於其分佈的百分位數,然後將該百分位數作爲列添加到數據框中。

據我所知,該survey封裝具有svyby()svyquantile()但是爲指定的位數,而不是對於給定的觀測值的位數後者返回值。

# Load survey package 
library(survey) 

# Set seed for replication 
set.seed(123) 

# Create data with value, group, weight 
dat <- data.frame(value = 1:6, 
        group = rep(1:3,2), 
        weight = abs(rnorm(6)) 
# Declare survey design 
d <- survey::svydesign(id =~1, data = dat, weights = weight) 

# Do something to calculate the quantile and add it to the data 
???? 

這類似於這個問題,但沒有被分組完成:Compute quantiles incorporating Sample Design (Survey package)

+0

https://stackoverflow.com/questions/32167390/compute-quantiles-incorporating-sample-design-survey-package/32173435#32173435或https://stackoverflow.com/questions/24587499/compute-多少百分之一富裕集中使用調查數據/ 24590340#24590340 –

+0

對不起,'quantile_by_stype'是由子組,不是嗎?我很困惑爲什麼使用svyby或子集來獲得你想要的子羣是不夠的?謝謝 –

+0

@AnthonyDamico這些似乎可以通過子羣來計算分位數,但(a)一旦完成就不會將值添加到前一組中。我最終使用了一個非常黑客的方法,我添加了一個答案。如果有辦法加快這個過程,很高興能夠修改! – user3614648

回答

0

我放在一起的解決方案。可以修改mutate()中的以下語句順序,將採樣權重轉換爲感興趣的分位數。雖然這可以在基數R中完成,但由於dplyr::bind_rows()的功率在連接兩個數據幀時添加到NA中,所以我使用dplyr數據包。

# Set seed for replication 
set.seed(123) 

# Create data with value, group, weight 
dat <- data.frame(value = 1:6, 
        group = rep(1:3,2), 
        weight = abs(rnorm(6)) 

# Initialize list for storing group results 
# Setting the length of the list is quicker than 
# creating an empty list and growing it 
quantile_list <- vector("list", length(unique(dat$group))) 

# Initialize variable to indicate initial iteration 
iteration <- 0 

# estimate the decile of each respondent 
# in a large for-loop 

for(group in unique(dat$group)) { 

# Keep only observations for a given group 
    temp <- dat %>% dplyr::filter(group == group) 

    # Create subset with missing values 
    temp_missing <- temp %>% dplyr::filter(is.na(value)) 

    # Create subset without missing values 
    temp_nonmissing <- temp %>% dplyr::filter(!is.na(value)) 

    # Sort observations with value on value, calculate cumulative 
    # sum of sampling weights, create variable indicating the decile 
    # of responses. 1 = lowest, 10 = highest 
    temp_nonmissing <- temp_nonmissing %>% 
          dplyr::arrange(value) %>% 
          dplyr::mutate(cumulative_weight = cumsum(weight), 
              cumulative_weight_prop = cumulative_weight/sum(weight), 
              decile = dplyr::case_when(cumulative_weight_prop < 0.10 ~ 1, 
              cumulative_weight_prop >= 0.10 & cumulative_weight_prop < 0.20 ~ 2, 
              cumulative_weight_prop >= 0.20 & cumulative_weight_prop < 0.30 ~ 3, 
              cumulative_weight_prop >= 0.30 & cumulative_weight_prop < 0.40 ~ 4, 
              cumulative_weight_prop >= 0.40 & cumulative_weight_prop < 0.50 ~ 5, 
              cumulative_weight_prop >= 0.50 & cumulative_weight_prop < 0.60 ~ 6, 
              cumulative_weight_prop >= 0.60 & cumulative_weight_prop < 0.70 ~ 7, 
              cumulative_weight_prop >= 0.70 & cumulative_weight_prop < 0.80 ~ 8, 
              cumulative_weight_prop >= 0.80 & cumulative_weight_prop < 0.90 ~ 9 , 
              cumulative_weight_prop >= 0.90 ~ 10)) 

    # Increment the iteration of the for loop 
    iteration <- iteration + 1 

    # Join the data with missing values and the data without 
    # missing values on the value variable into 
    # a single data frame 
    quantile_list[[iteration]] <- dplyr::bind_rows(temp_nonmissing, temp_missing) 
    } 

# Convert the list of data frames into a single dataframe 
out <- dplyr::bind_rows(quantile_list) 

# Show outcome 
head(out)