3
我有從一組用戶的在線活動的一些數據:
userId
指示用戶的ID。pageType
指示用戶所在的當前頁面。home
表示主頁,而content
表示內容頁面。- 頁面已按時間排序,所以第1行發生在第2行之前,第2行發生在第3行之前,...
- 實際數據大約有200萬行和8種頁面類型。
userId
是一個36字符的java.util.UUID
對象。
的完全相同的類型的[目標]
欲生成每個pageType
一個新的列和計數的前頁視圖的數目(不包括電流)。
[樣品數據]
產生實際數據的樣本:
library(data.table)
DT <- data.table("userId"=rep(1:3, each=10),
"pageType"=c("home", "content", "home", "content", "home", "home", "content", "content", "home", "home",
"content", "content", "home", "home", "content", "home", "home", "content", "home", "content",
"home", "home", "content", "content", "home", "home", "content", "content", "home", "content"))
> DT
userId pageType
1: 1 home
2: 1 content
3: 1 home
4: 1 content
5: 1 home
6: 1 home
7: 1 content
8: 1 content
9: 1 home
10: 1 home
... ... ...
[我嘗試]
我已經嘗試了兩種方式來解決這個問題,但他們都太慢了。我也覺得我的解決方案沒有按照它的設計方式使用data.table
。
解決方案我
- 通過
pageType
和增量由userId
篩選。 - 爲其他
pageType
設置缺失值。
下面是代碼:
FixPageView <- function(data, type) {
val <- 0
for (i in 1:nrow(data)) {
if (is.na(data[[type]][i])) {
set(data, i, type, val)
} else {
val <- data[[type]][i]
}
}
}
DT[pageType=="home", numHomePagesViewed:=0:(.N-1), by=userId]
DT[pageType=="content", numContentPagesViewed:=0:(.N-1), by=userId]
FixPageView(DT, "numHomePagesViewed")
FixPageView(DT, "numContentPagesViewed")
> DT
userId pageType numHomePagesViewed numContentPagesViewed
1: 1 home 0 0
2: 1 content 0 0
3: 1 home 1 0
4: 1 content 1 1
5: 1 home 2 1
6: 1 home 3 1
7: 1 content 3 2
8: 1 content 3 3
9: 1 home 4 3
10: 1 home 5 3
... ... ... ... ...
解決方案II
雙for
循環,並將其設置逐行。
DT[, numHomePagesViewed := 0L][, numContentPagesViewed := 0L]
for (i in unique(DT$userId)) {
home_inc <- -1L
content_inc <- -1L
for (j in 1L:nrow(DT[userId==i])) {
if (DT$pageType[(i-1L)*10L + j] == "home") {
home_inc <- home_inc + 1L
set(DT, (i-1L)*10L + j, "numHomePagesViewed", home_inc)
} else {
set(DT, (i-1L)*10L + j, "numHomePagesViewed", max(0, home_inc))
}
if (DT$pageType[(i-1L)*10L + j] == "content") {
content_inc <- content_inc + 1L
set(DT, (i-1L)*10L + j, "numContentPagesViewed", content_inc)
} else {
set(DT, (i-1L)*10L + j, "numContentPagesViewed", max(0, content_inc))
}
}
}
> DT
userId pageType numHomePagesViewed numContentPagesViewed
1: 1 home 0 0
2: 1 content 0 0
3: 1 home 1 0
4: 1 content 1 1
5: 1 home 2 1
6: 1 home 3 1
7: 1 content 3 2
8: 1 content 3 3
9: 1 home 4 3
10: 1 home 5 3
... ... ... ... ...
[問題]
- 我能做些什麼來提高速度?
- 有沒有更多的「
data.table
」方式來解決這個問題?
,你介意更新您的答案是:
正如評論所說,你可以用一條線指定的名字呢?它一次完成命名:'DT [,c(「numHomePagesViewed」,「numContentPagesViewed」):= lapply(unique(pageType),function(x)pmax(cumsum(pageType == x)-1,0)) ,by = userId]' – Boxuan