匹配值基於組ID

假設我有以下的數據幀（實際的一個代表非常大的數據集）匹配值基於組ID

df<- structure(list(x = c(1, 1, 1, 2, 2, 3, 3, 3), y = structure(c(1L, 
6L, NA, 2L, 4L, 3L, 7L, 5L), .Label = c("all", "fall", "hello", 
"hi", "me", "non", "you"), class = "factor"), z = structure(c(5L, 
NA, 4L, 2L, 1L, 6L, 3L, 4L), .Label = c("fall", "hi", "me", "mom", 
"non", "you"), class = "factor")), .Names = c("x", "y", "z"), row.names = c(NA, 
-8L), class = "data.frame")

它看起來像

>df 
    x  y z 
1 1 all non 
2 1 non <NA> 
3 1 <NA> mom 
4 2 fall hi 
5 2 hi fall 
6 3 hello you 
7 3 you me 
8 3 me mom

我所試圖做的是計算每組x（1,2或3）中匹配值的數量。例如，組號1有一個匹配值，即"non"（NA應該被忽略）。所需的輸出看起來像：

試圖想在做這個，而不是for-loop，因爲我有一個大的數據集的方式，但無法通過找到我的路。

來源

2015-07-03 athraa

使用dplyr：

library(dplyr) 

df %>% group_by(x) %>% 
     summarise(n = sum(y %in% na.omit(z)))

來源

2015-07-03 01:03:21 jeremycg

真的不知道爲什麼它不給我所需的輸出。它給了我'n 1 5' – athraa

@AhmedSalhin適合我。也許'plyr'干擾。我認爲這些軟件包有一些不兼容性，具體取決於它們的加載順序。 – Frank

@Frank是的，你是對的。我把'plyr'分開了，它適用於我。你知道如何克服這個干擾問題嗎？ – athraa

下面是使用by()和match()一個解決方案：

do.call(rbind,by(df,df$x,function(g) c(x=g$x[1],n=sum(!is.na(match(g$y,g$z,inc=NA)))))); 
## x n 
## 1 1 1 
## 2 2 2 
## 3 3 2

來源

2015-07-03 01:33:31 bgoldst

我喜歡這個基礎R解決方案......說實話，我的是長和笨拙，我更喜歡這一個。投票！ – SabDeM

只是爲了每夜樂趣我已經嘗試了基礎R解決方案，這當然是醜得要命。

ind <- by(df, df$x, function(x) which(na.omit(x[["y"]]) %in% na.omit(df[["z"]]))) 
sm <- lapply(ind, length) 
cbind(unique(df$x), sm) 
sm 
1 1 1 
2 2 2 
3 3 2

另一個基礎R方法，用更少的代碼（和更少的醜陋，我希望）：

ind <- by(df, df$x, function(x) sum(na.omit(x[["y"]]) %in% na.omit(x[["z"]]))) 
cbind(unique(df$x), ind) 
    ind 
1 1 1 
2 2 2 
3 3 2

來源

2015-07-03 01:36:05 SabDeM

匹配值基於組ID

回答

相關問題