基於日期的dplyr中的條件彙總

我是R noob，並且試圖對數據集執行摘要，該數據集對該ID的類型「B」的事件之間發生的每個ID總計事件類型的數量。下面是一個示例來說明：基於日期的dplyr中的條件彙總

id <- c('1', '1', '1', '2', '2', '2', '3', '3', '3', '3') 
type <- c('A', 'A', 'B', 'A', 'B', 'C', 'A', 'B', 'C', 'B') 
datestamp <- as.Date(c('2016-06-20','2016-07-16','2016-08-14','2016-07-17' 
         ,'2016-07-18','2016-07-19','2016-07-16','2016-07-19' 
         , '2016-07-21','2016-08-20')) 
df <- data.frame(id, type, datestamp)

其產生：

> df 
    id type datestamp 
1 1 A 2016-06-20 
2 1 A 2016-07-16 
3 1 B 2016-08-14 
4 2 A 2016-07-17 
5 2 B 2016-07-18 
6 2 C 2016-07-19 
7 3 A 2016-07-16 
8 3 B 2016-07-19 
9 3 C 2016-07-21 
10 3 B 2016-08-20

事件「B」發生的任何時間，我想知道的是乙事件之前發生的每個事件類型的數量，但在該ID的任何其他B事件之後。我想直到結束是這樣的一個表：

id type B_instance count 
1 1 A   1  2 
2 2 A   1  1 
3 3 A   1  1 
4 3 C   2  1

在研究，這個問題就來了最靠近：summarizing a field based on the value of another field in dplyr

我一直在努力使這項工作：

df2 <- df %>% 
    group_by(id, type) %>% 
    summarize(count = count(id[which(datestamp < datestamp[type =='B'])])) %>% 
    filter(type != 'B')

但它錯誤（即使它工作，它也不會在同一個ID中佔用2'B'事件，例如id = 3）

來源

2016-08-23 feyr

您可以使用cumsum通過執行cumsum(type == "B")來創建新的組變量B_instance，然後篩選掉落在最後一個B以及類型B本身之後的類型，因爲它們不會被計算在內。然後使用count來計算該組發生的次數爲id,B_instance和type。

df %>% 
     group_by(id) %>% 
     # create B_instance using cumsum on the type == "B" condition 
     mutate(B_instance = cumsum(type == "B") + 1) %>%  
     # filter out rows with type behind the last B and all B types     
     filter(B_instance < max(B_instance), type != "B") %>% 
     # count the occurrences of type grouped by id and B_instance 
     count(id, type, B_instance) 

# Source: local data frame [4 x 4] 
# Groups: id, type [?] 

#  id type B_instance  n 
# <fctr> <fctr>  <dbl> <int> 
# 1  1  A   1  2 
# 2  2  A   1  1 
# 3  3  A   1  1 
# 4  3  C   2  1

來源

2016-08-23 19:14:15 Psidom

這個完美的作品！謝謝！出於好奇，爲什麼cumsum需要也由1？ – feyr

遞增以匹配實例數，否則將從零開始，而結果會像'0,0,0,1'而不是'1,1,1,2'。 – Psidom

下面是使用data.table一個選項。我們將'data.frame'轉換爲'data.table'（setDT(df)，按'id'分組，我們得到'type'爲'B'的max位置的序列，找到行索引（.I）然後，我們將數據集（df[i1]）進行子集化，刪除'type'爲'B'的行，按'id'，'type'和'type'的rleid分組，得到行數作爲「計數」。

library(data.table) 
i1 <- setDT(df)[, .I[seq(max(which(type=="B")))] , by = id]$V1 
df[i1][type!="B"][, .(count = .N), .(id, type, B_instance = rleid(type))] 
# id type B_instance count 
#1: 1 A  1  2 
#2: 2 A  1  1 
#3: 3 A  1  1 
#4: 3 C  2  1

來源

2016-08-23 19:22:10 akrun

這也很好，謝謝。@ Psidom's dplyr解決方案對我來說更直觀。但使用data.table有沒有好處，我不知道？或者只是個人喜好？ – feyr

@feyr他們都是很好的包。如果你想利用這個賦值（'：='）（這裏沒有完成），哪個data.table可以做到並且效率很高。但是，在這種情況下，psidom的解決方案將與我的一樣出色，甚至更加優雅。 – akrun

基於日期的dplyr中的條件彙總

回答

相關問題