2016-08-23 178 views
2

我知道Randy在Sessonizing Log Data上有一篇很棒的文章,但我正在努力調整基於30分鐘不活動窗口生成會話ID的想法。創建Web會話

這是我希望在R,最好是dplyr產生。我正在計算下面顯示的session_id變量。

dim_user_id  activity_date session_id 
1  2665871 2014-12-31 19:00:08   1 
2  2665871 2014-12-31 19:00:45   1 
3  2665871 2014-12-31 19:01:01   1 
4  2665877 2014-12-31 19:00:08   2 
5  2665877 2014-12-31 19:00:33   2 
6  2666612 2014-12-31 19:08:19   3 
7  2666612 2014-12-31 19:08:32   3 
8  2666612 2014-12-31 19:09:04   3 
9  2666626 2014-12-31 19:00:25   4 
10  2666627 2014-12-31 19:04:39   5 

,我嘗試使用的代碼是:

user_activity$sid = 1:nrow(user_activity) 
user_activity$session_id = NA 
# startTime = Sys.time() 
user_activity = user_activity %>% 
    group_by(dim_user_id) %>% 
    arrange(activity_date) %>% 
    transform(lag_seconds = ifelse(lag(dim_user_id) == dim_user_id, 
           as.numeric(activity_date - lag(activity_date)), 
           9999)) %>% 
    mutate(session_id = ifelse(is.na(lag_seconds) | lag_seconds >= 1801, sid, lag(session_id))) 

但我遇到的問題是,我不相信價值被設置行明智。我確實在dplyr中探索rowwwise函數,但是我卡住了。

在此先感謝。

回答

2

如果我理解你正確地你正在尋找​​您可以使用如下:

df %>% mutate(session_id = group_indices_(df, .dots="dim_user_id")) 

編輯: 作爲您的樣本數據不提供一個用戶具有30 +時間多個會話的情況下, DIFF我用這個改變的數據集:

df <- read.table(header=TRUE, text="dim_user_id date time 
2665871 2014-12-31 19:00:08 
2665871 2014-12-31 19:00:45 
2665871 2014-12-31 19:01:01 
2665877 2014-12-31 19:00:08 
2665877 2014-12-31 19:00:33 
2666612 2014-12-31 19:08:19 
2666612 2014-12-31 19:38:32 
2666612 2014-12-31 19:39:04 
2666626 2014-12-31 19:00:25 
2666627 2014-12-31 19:04:39") 

df$activity_date <- as.POSIXct(paste(df$date, df$time)) 
df$date <- NULL 
df$time <- NULL 

所以用戶#2666612具有30+分鐘的滯後。以下代碼將逐步計算您的session_id。我相信它可以縮短,但這是澄清。

require(dplyr) 
cuttoff <- 30*60 # 30 min times 60 seconds. 
df %>% 
    # group by user_id 
    group_by(dim_user_id) %>% 
    # Difference in seconds within a given user 
    mutate(time_diff = c(0, diff(activity_date))) %>% 
    # If the difference is >cutoff start new session 
    mutate(session_num = cumsum(time_diff>cuttoff)) %>% 
    # ungroup to set group_indices data-wide instead of groupwide 
    ungroup() %>% 
    # calculate group_indices based in user_id and session_num 
    mutate(session_id = group_indices_(., .dots=c("dim_user_id", "session_num"))) 

導致:

Source: local data frame [10 x 5] 

    dim_user_id  activity_date time_diff session_num session_id 
     (int)    (time)  (dbl)  (int)  (int) 
1  2665871 2014-12-31 19:00:08   0   0   1 
2  2665871 2014-12-31 19:00:45  37   0   1 
3  2665871 2014-12-31 19:01:01  16   0   1 
4  2665877 2014-12-31 19:00:08   0   0   2 
5  2665877 2014-12-31 19:00:33  25   0   2 
6  2666612 2014-12-31 19:08:19   0   0   3 
7  2666612 2014-12-31 19:38:32  1813   1   4 
8  2666612 2014-12-31 19:39:04  32   1   4 
9  2666626 2014-12-31 19:00:25   0   0   5 
10  2666627 2014-12-31 19:04:39   0   0   6 
+0

我不知道group_indices'的',這是夢幻般的。我需要做的唯一事情就是如果用戶的下一個活動日期在30分鐘以後發生,則創建一個新的會話。我偶然發現了這個例子,但我可能需要調整它。 http://stackoverflow.com/questions/20413514/creating-a-session-visit-id-in-r?rq=1 – Btibert3

+1

@ Btibert3請參閱我的編輯。下次請將這些邊緣案例添加到您的示例數據集 – Rentrop

+0

這真是太棒了,謝謝。一個問題。我看到'cumsum'在各種解決方案中彈出。即使看到它,它在做什麼? – Btibert3