2016-05-16 17 views
3

我有一個數據集,看起來像這樣:集團行多達當前行中的R data.table

library(data.table) 

set.seed(10) 

n_rows <- 50 

data <- data.table(id = 1:n_rows, 
        timestamp = Sys.Date() + as.difftime(1:n_rows, units = "days"), 
        subject = sample(letters[1:4], n_rows, replace = T), 
        response = sample(3, n_rows, replace = T) 
        ) 

head(data, 10) 

    id timestamp subject response 
1: 1 2016-05-17  c  2 
2: 2 2016-05-18  b  3 
3: 3 2016-05-19  b  1 
4: 4 2016-05-20  c  2 
5: 5 2016-05-21  a  1 
6: 6 2016-05-22  a  2 
7: 7 2016-05-23  b  2 
8: 8 2016-05-24  b  2 
9: 9 2016-05-25  c  2 
10: 10 2016-05-26  b  2 

我需要通過操作做一些組按主題迄今爲止每個響應的那筆出現次數。

下面的組通過產生nth_test列。

new_vars <- data[, .(id, timestamp, nth_test = 1:.N, response), by=.(subject)] 

    subject id timestamp nth_test response 
1:  c 1 2016-05-17  1  2 
2:  c 4 2016-05-20  2  2 
3:  c 9 2016-05-25  3  2 
4:  c 11 2016-05-27  4  1 
5:  c 12 2016-05-28  5  1 
6:  c 14 2016-05-30  6  2 
7:  c 22 2016-06-07  7  2 
8:  c 26 2016-06-11  8  2 
9:  c 31 2016-06-16  9  3 
10:  c 36 2016-06-21  10  1 

但我不知道如何生產列resp_1,resp_2 & resp_3像下面。

subject id timestamp nth_test response resp_1 resp_2 resp_3 
1:  c 1 2016-05-17  1  2  0  1  0 
2:  c 4 2016-05-20  2  2  0  2  0 
3:  c 9 2016-05-25  3  2  0  3  0 
4:  c 11 2016-05-27  4  1  1  3  0 
5:  c 12 2016-05-28  5  1  2  3  0 
6:  c 14 2016-05-30  6  2  2  4  0 
7:  c 22 2016-06-07  7  2  2  5  0 
8:  c 26 2016-06-11  8  2  2  6  0 
9:  c 31 2016-06-16  9  3  2  6  1 
10:  c 36 2016-06-21  10  1  3  6  1 

乾杯

+2

您的數據是如何排序的,因爲這些列值取決於您的數據的順序?你可以做一些類似'resp_i:= cumsum(response == i)' – Psidom

+0

Psidom這正是我需要的,謝謝。 – efbbrown

回答

3

我們可以嘗試

Un1 <- unique(sort(data$response)) 
data[, c("nth_test", paste("resp", Un1, sep="_")) := c(list(1:.N), 
     lapply(Un1, function(x) cumsum(x==response))) , .(subject)] 
data[order(subject, timestamp)][subject=="c"] 
# id timestamp subject response nth_test resp_1 resp_2 resp_3 
# 1: 1 2016-05-17  c  2  1  0  1  0 
# 2: 4 2016-05-20  c  2  2  0  2  0 
# 3: 9 2016-05-25  c  2  3  0  3  0 
# 4: 11 2016-05-27  c  1  4  1  3  0 
# 5: 12 2016-05-28  c  1  5  2  3  0 
# 6: 14 2016-05-30  c  2  6  2  4  0 
# 7: 22 2016-06-07  c  2  7  2  5  0 
# 8: 26 2016-06-11  c  2  8  2  6  0 
# 9: 31 2016-06-16  c  3  9  2  6  1 
#10: 36 2016-06-21  c  1  10  3  6  1 
#11: 39 2016-06-24  c  1  11  4  6  1 
#12: 40 2016-06-25  c  1  12  5  6  1 
#13: 44 2016-06-29  c  2  13  5  7  1 
+1

謝謝,漂亮優雅的解決方案。 – efbbrown

+0

很好的答案,但如果你稍後再對它進行子集化處理,那麼在「subject」上的順序是什麼?當然它是更好的子集,然後按'timestamp'排序。 – jangorecki

+1

@jangorecki你說得對。我只是在OP的帖子上顯示了預期的輸出結果。 – akrun

0

,我想看看這會是什麼樣如果在data.table在長格式cummax/cumsum做(也許是在某些配置中效率更高):

> data[order(subject, timestamp) 
+  ][, rCnt := 1:.N, .(subject, response) 
+  ][, responseStr := sprintf('%s_%s', 'resp', response) 
+  ][, dcast(.SD, id + timestamp + subject + response ~ responseStr, value.var='rCnt', fill=0) 
+  ][, melt(.SD, id.vars=c('id', 'timestamp', 'subject', 'response')) 
+  ][order(subject, timestamp) 
+  ][, value := cummax(value), .(subject, variable) 
+  ][, nth_test := 1:.N, .(subject, variable) 
+  ][, dcast(.SD, id + timestamp + subject + response + nth_test ~ variable, value.var='value') 
+  ][order(subject, timestamp) 
+  ][subject == 'c' 
+  ] 
    id timestamp subject response nth_test resp_1 resp_2 resp_3 
1: 1 2016-05-17  c  2  1  0  1  0 
2: 4 2016-05-20  c  2  2  0  2  0 
3: 9 2016-05-25  c  2  3  0  3  0 
4: 11 2016-05-27  c  1  4  1  3  0 
5: 12 2016-05-28  c  1  5  2  3  0 
6: 14 2016-05-30  c  2  6  2  4  0 
7: 22 2016-06-07  c  2  7  2  5  0 
8: 26 2016-06-11  c  2  8  2  6  0 
9: 31 2016-06-16  c  3  9  2  6  1 
10: 36 2016-06-21  c  1  10  3  6  1 
11: 39 2016-06-24  c  1  11  4  6  1 
12: 40 2016-06-25  c  1  12  5  6  1 
13: 44 2016-06-29  c  2  13  5  7  1 
>