2016-12-16 45 views
0

我下面的數據集R:滾動/滑動中的R窗口和重複計數滑動天數

set.seed(1) 
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24) 
set.seed(1) 
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date) 
df_ordered <- df[order(df$cate,df$prod,df$transaction_date,df$customer_id),] 
df_ordered 

categ prod customer_id transaction_date 
1 Categ1 Prod1 customer 1  2016-01-09 
3 Categ1 Prod1 customer 3  2016-01-18 
19 Categ1 Prod1 customer 19  2016-01-28 
7 Categ1 Prod1 customer 7  2016-01-29 
5 Categ1 Prod2 customer 5  2016-01-06 
23 Categ1 Prod2 customer 23  2016-01-07 
13 Categ1 Prod2 customer 13  2016-01-14 
9 Categ1 Prod2 customer 9  2016-01-16 
15 Categ1 Prod2 customer 15  2016-01-20 
21 Categ1 Prod2 customer 21  2016-01-24 
11 Categ1 Prod3 customer 11  2016-01-05 
17 Categ1 Prod3 customer 17  2016-01-31 
10 Categ2 Prod1 customer 10  2016-01-02 
20 Categ2 Prod1 customer 20  2016-01-11 
24 Categ2 Prod1 customer 24  2016-01-23 
16 Categ2 Prod1 customer 16  2016-02-01 
12 Categ2 Prod2 customer 12  2016-01-04 
4 Categ2 Prod2 customer 4  2016-01-27 
22 Categ2 Prod3 customer 22  2016-01-03 
14 Categ2 Prod3 customer 14  2016-01-08 
2 Categ2 Prod3 customer 2  2016-01-12 
18 Categ2 Prod3 customer 18  2016-01-15 
8 Categ2 Prod3 customer 8  2016-01-17 
6 Categ2 Prod3 customer 6  2016-01-25 

我已經做了12天,從第一個窗口,獨特的客戶數超過(最小)在categprod上觀察到的transaction_date。

在當前交易日期前12天滑動窗口,並計入該存儲桶中的所有交易的計數。以下是我正在嘗試創建的輸出。我想避免爲這個任務循環。

enter image description here

+1

的可能的複製[通過data.table非等距相對窗運行總和加入(http://stackoverflow.com/questions/41007099/relative-windowed-running-sum-through-data-table- non-equi-join) – ExperimenteR

回答

3

運用zoo這個dplyrrollapply可以實現的。首先,我們填寫所有組的所有缺失日期,以便我們有一個連續的系列,使用expand.gridmerge。然後,我們按類別和產品進行分組,按日期進行排列,並將滾動窗口應用於客戶ID中的值。我們定義的在每個步驟中應用的函數採用唯一值向量的長度,並刪除了NAs。最後,我們再次過濾出添加的日期,其中customer_id不可用。

library(dplyr) 
library(zoo) 

set.seed(1) 
transaction_date <- sample(seq(as.Date('2016/01/01'), as.Date('2016/02/01'), by="day"), 24) 
set.seed(1) 
df <- data.frame("categ" = paste0("Categ",rep(1:2,12)), "prod" = sample(paste0("Prod",rep(seq(1:3),8))), customer_id = paste0("customer ",seq(1:24)),transaction_date=transaction_date) 

all_combinations <- expand.grid(categ=unique(df$categ), 
     prod=unique(df$prod), 
     transaction_date=seq(min(df$transaction_date), max(df$transaction_date), by="day")) 

df <- merge(df, all_combinations, by=c('categ','prod','transaction_date'), all=TRUE) 

res <- df %>% 
     group_by(categ, prod) %>% 
     arrange(transaction_date) %>% 
     mutate(ucust=rollapply(customer_id, width=12, FUN=function(x) length(unique(x[!is.na(x)])), partial=TRUE, align='left')) %>% 
     filter(!is.na(customer_id)) 
+1

對不起,我太快了,現有日期重複。我現在糾正了它。 – mpjdem