的R - 留一個聚集上的分組變量（NA存在）

欲計算從「具有」的數據集，R中新的變量，如下所示：的R - 留一個聚集上的分組變量（NA存在）

RE：的「R」值的平均在給定的「Cat」變量值內，不包括具體的觀察值（注意：缺少數據存在，我希望RE在R缺失時作爲RE的組平均值）。 IE：與RE一樣，給定「Cat」變量值內的「I」響應的平均值不包括特定觀察值（相同的缺失數據技術）。

下面給出了一個示例數據集和所需的輸出。

有：

ID CAT R I … (Additional variables I need to retain) 
1 1 1 3 … 
2 1 2 NA … 
3 1 1 1 … 
4 2 NA 3 … 
5 2 4 5 … 
6 2 4 NA …

的期望的數據集（「想」），則應該是：

想要：

ID CAT R I RE IE  … (Additional variables retained) 
1 1 1 3 1.5 1  … 
2 1 2 NA 1 2  … 
3 1 1 1 1.5 3  … 
4 2 NA 3 ... ... … 
5 2 4 5    … 
6 2 4 NA    …

值得注意的是，下面的基於SQL的溶液產生所需的輸出在 SAS，但我無法得到它在R（使用sqldf包）工作。我知道的一個問題是缺少的函數是SAS特定的（通用SQL中不可用）。所有這一切都可能爲使用sqldf包的SQL解決方案提供了一個有用的起點：

proc sql; 
create table want as 
select *, 
    (sum(R)-coalesce(R, 0))/(count(R)-1+missing(R)) as RE, 
    (sum(I)-coalesce(I, 0))/(count(I)-1+missing(I)) as IE 
from have 
group by CAT 
order by ID, CAT; 
quit;

非常感謝您的幫助。

來源

2017-07-27 Justin

With dplyr如果您熟悉該域中的概念，則可以將函數應用於行的子集，而不會影響其他行，有點像sql中的「窗口」。

創建一個函數來爲一個ID組執行所需的計算。使用group_by()對行進行分組，然後將結果傳遞給mutate()並運行自定義函數。對於分組數據，它一次只會影響一個組，並給出所需的結果。

library(dplyr) 

# Data from example 
have <- read.table(header = TRUE, text = 
"ID CAT R I 
1 1 1 3 
2 1 2 NA 
3 1 1 1 
4 2 NA 3 
5 2 4 5 
6 2 4 NA") 

# Create a leave-one-out mean function -- for a single ID group 

leave_one_out_mean <- function(x) { 
    result <- c() 

    for (i in seq_along(x)) { 
     # note minus-i subsetting is used to subset one observation in each iteration 
     # and the na.rm option to handle missing values 
     result[i] <- mean(x[-i], na.rm = TRUE) 
    } 

    return(result) 
} 

# Use group by but _do not_ pipe the result through summarize() 

want <- have %>% 
    group_by(CAT) %>% 
    mutate(RE = leave_one_out_mean(R), 
      IE = leave_one_out_mean(I))

結果

want 

Source: local data frame [6 x 6] 
Groups: CAT [2] 

    ID CAT  R  I RE IE 
    <int> <int> <int> <int> <dbl> <dbl> 
1  1  1  1  3 1.5  1 
2  2  1  2 NA 1.0  2 
3  3  1  1  1 1.5  3 
4  4  2 NA  3 4.0  5 
5  5  2  4  5 4.0  3 
6  6  2  4 NA 4.0  4

的for循環可以與應用功能所取代，但我之所以如此突出的邏輯，而不是優化執行。

來源

2017-07-27 22:57:11 Damian

謝謝，這個效果很好。您是否還可以使用apply函數來演示優化執行？ – Justin

我想到的是用'sapply（seq_along（x），function（i）mean（x [-i]，na.rm = TRUE））'替換函數的主體' - 但我沒有確認它實際上更快 – Damian

一個基本的R解決方案，沒有循環，受你的SQL代碼的啓發。

d <- read.table(text = 
'ID CAT R I 
1 1 1 3 
2 1 2 NA 
3 1 1 1 
4 2 NA 3 
5 2 4 5 
6 2 4 NA', header = TRUE) 

myfunc <- function(x) { 
    tmp <- x ; tmp[is.na(tmp)] <- 0 
    ((sum(x, na.rm = TRUE)-tmp)/(length(x[!is.na(x)])-1 + is.na(x))) 
} 
RE <- as.vector(t(aggregate(d["R"], d["CAT"], myfunc)$R)) 
IE <- as.vector(t(aggregate(d["I"], d["CAT"], myfunc)$I)) 

cbind(d, RE, IE)

來源

2017-07-27 23:16:41 Gilles

也是一個有用的解決方案，雖然我必須在cbind上有一些東西，但數據集沒有加入。 – Justin

的R - 留一個聚集上的分組變量（NA存在）

回答

相關問題