2017-07-07 112 views
3

我正在處理數字測量的數據框。有些人已經多次測量過,包括青少年和成年人。 可再現例如:子集數據幀只包含一個因子在另一個因子的兩個水平都有值的水平

ID <- c("a1", "a2", "a3", "a4", "a1", "a2", "a5", "a6", "a1", "a3") 
age <- rep(c("juvenile", "adult"), each=5) 
size <- rnorm(10) 

# e.g. a1 is measured 3 times, twice as a juvenile, once as an adult. 
d <- data.frame(ID, age, size) 

我的目標是通過選擇至少出現一次作爲少年和至少一次作爲一個成年人的ID到子集的數據幀。不知道該怎麼做..?

生成的數據框將包含個人a1,a2和a3的所有測量結果,但會排除a4,a5和a6,因爲它們在兩個階段均未測量。

類似的問題被問7個月前,但從來沒有一個答案(Subset data frame to include only levels one factor that have values in both levels of another factor

謝謝!

回答

3

這裏是data.table

library(data.table) 
setDT(d)[, .SD[all(c("juvenile", "adult") %in% age)], ID] 
一個選項0

或用ave

d[with(d, ave(as.character(age), ID, FUN = function(x) length(unique(x)))>1),] 
# ID  age  size 
#1 a1 juvenile -1.4545407 
#2 a2 juvenile -0.4695317 
#3 a3 juvenile 0.2271316 
#5 a1 juvenile 0.2961210 
#6 a2 adult -0.8331993 
#9 a1 adult -0.6924967 
#10 a3 adult -0.4619550 
一個 base R選項
4

隨着dplyr,您可以使用group_by %>% filter

library(dplyr) 
d %>% group_by(ID) %>% filter(all(c("juvenile", "adult") %in% age)) 

# A tibble: 7 x 3 
# Groups: ID [3] 
#  ID  age  size 
# <fctr> <fctr>  <dbl> 
#1  a1 juvenile -0.6947697 
#2  a2 juvenile -0.3665272 
#3  a3 juvenile 1.0293555 
#4  a1 juvenile 0.2745224 
#5  a2 adult 0.5299029 
#6  a1 adult 2.2247802 
#7  a3 adult -0.4717160 
4

split通過ageintersect和子集:

d[d$ID %in% Reduce(intersect, split(d$ID, d$age)),] 
# ID  age  size 
#1 a1 juvenile 1.44761836 
#2 a2 juvenile 1.70098645 
#3 a3 juvenile 0.08231986 
#5 a1 juvenile 0.91240568 
#6 a2 adult -1.77318962 
#9 a1 adult 0.13597986 
#10 a3 adult -1.18575294 
相關問題