計算新的值不會更早發生並且不會發生在最後一組中

我正在計算每月唯一的「新」用戶數。新用戶以前沒有出現過（自開始以來）我也在統計上個月沒有出現的唯一身份用戶的數量。計算新的值不會更早發生並且不會發生在最後一組中

原始數據看起來像

library(dplyr) 
    date <- c("2010-01-10","2010-02-13","2010-03-22","2010-01-11","2010-02-14","2010-03-23","2010-01-12","2010-02-14","2010-03-24") 
    mth <- rep(c("2010-01","2010-02","2010-03"),3) 
    user <- c("123","129","145","123","129","180","180","184","145") 

    dt <- data.frame(date,mth,user) 

    dt <- dt %>% arrange(date) 

    dt 

     date  mth user 
1 2010-01-10 2010-01 123 
2 2010-01-11 2010-01 123 
3 2010-01-12 2010-01 180 
4 2010-02-13 2010-02 129 
5 2010-02-14 2010-02 129 
6 2010-02-14 2010-02 184 
7 2010-03-22 2010-03 145 
8 2010-03-23 2010-03 180 
9 2010-03-24 2010-03 145

答案應該看起來像

new <- c(2,2,2,2,2,2,1,1,1) 
    totNew <- c(2,2,2,4,4,4,5,5,5) 
    notLastMonth <- c(2,2,2,2,2,2,2,2,2) 

    tmp <- cbind(dt,new,totNew,notLastMonth) 
    tmp 

     date  mth user new totNew notLastMonth 
1 2010-01-10 2010-01 123 2  2   2 
2 2010-01-11 2010-01 123 2  2   2 
3 2010-01-12 2010-01 180 2  2   2 
4 2010-02-13 2010-02 129 2  4   2 
5 2010-02-14 2010-02 129 2  4   2 
6 2010-02-14 2010-02 184 2  4   2 
7 2010-03-22 2010-03 145 1  5   2 
8 2010-03-23 2010-03 180 1  5   2 
9 2010-03-24 2010-03 145 1  5   2

來源

2017-01-09 user3482393

有你想要的新的，totnew和notLastMonth的總人數將在該「用戶」表...的理由似乎很奇怪將其存儲在用戶記錄中。獲取新客戶很簡單，但按用戶分組，然後變更一個新列，讓他們看到他們出現的第一個月。然後按新列分組，然後統計用戶。 – Shape

這裏（代碼體內的解釋）企圖

dt %>% 
    group_by(user) %>% 
    mutate(Count = row_number()) %>% # Count appearances per user 
    group_by(mth) %>% 
    mutate(new = sum(Count == 1)) %>% # Count first appearances per months 
    summarise(new = first(new), # Summarise new users per month (for cumsum) 
      users = list(unique(user))) %>% # Create a list of unique users per month (for notLastMonth) 
    mutate(totNew = cumsum(new), # Calculate overall cummulative sum of unique users 
     notLastMonth = lengths(Map(setdiff, users, lag(users)))) %>% # Compare new users to previous month 
    select(-users) %>% 
    right_join(dt) # Join back to the real data 

# A tibble: 9 × 6 
#  mth new totNew notLastMonth  date user 
# <fctr> <int> <int>  <int>  <fctr> <fctr> 
# 1 2010-01  2  2   2 2010-01-10 123 
# 2 2010-01  2  2   2 2010-01-11 123 
# 3 2010-01  2  2   2 2010-01-12 180 
# 4 2010-02  2  4   2 2010-02-13 129 
# 5 2010-02  2  4   2 2010-02-14 129 
# 6 2010-02  2  4   2 2010-02-14 184 
# 7 2010-03  1  5   2 2010-03-22 145 
# 8 2010-03  1  5   2 2010-03-23 180 
# 9 2010-03  1  5   2 2010-03-24 145

來源

2017-01-09 22:42:38

這工作作爲adverised。大量使用dplyr，分組和多次變異，聰明。非常感謝！還沒有見過使用notLastMonth = lengths（Map（setdiff，users，lag（users））））之前。 – user3482393

這裏的另一個想法從「mth」列表中的「user」開始：

table(dt[c("user", "mth")]) > 0L

假設這個路徑很可能導致存儲器的問題，我們可以利用稀疏替代開始：

library(Matrix) 
tab = as(xtabs(~ user + mth, dt, sparse = TRUE) > 0L, "TsparseMatrix") 
tab 
#5 x 3 sparse Matrix of class "lgTMatrix" 
# 2010-01 2010-02 2010-03 
#123  |  .  . 
#129  .  |  . 
#145  .  .  | 
#180  |  .  | 
#184  .  |  .

然後，將具有「第m個」（如列索引）每個「用戶」首次出現：

tapply([email protected], rownames(tab)[[email protected] + 1L], min) + 1L 
#123 129 145 180 184 
# 1 2 3 1 2

我們可以發現每個「第m個」新條目數：

new = setNames(tabulate(tapply([email protected], rownames(tab)[[email protected] + 1L], min) + 1L, 
         ncol(tab)), 
       colnames(tab)) 
new 
#2010-01 2010-02 2010-03 
#  2  2  1

和新項目的累計金額：

totNew = cumsum(new) 
totNew 
#2010-01 2010-02 2010-03 
#  2  4  5

而且，減去「用戶」每個「第m個」是在「第m個」同時存在及其以前的數量：

setNames(colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab), colnames(tab)) 
#2010-01 2010-02 2010-03 
#  0  0  0

從用戶每月數：

colSums(tab) 
#2010-01 2010-02 2010-03 
#  2  2  2

我們得到：

notLast = colSums(tab) - colSums(cbind(FALSE, tab[, -ncol(tab)]) & tab) 
notLast 
#2010-01 2010-02 2010-03 
#  2  2  2

一種方法以達到期望的輸出可以是：

merge(dt, data.frame(mth = names(new), new, totNew, notLast), by = "mth") 
#  mth  date user new totNew notLast 
#1 2010-01 2010-01-10 123 2  2  2 
#2 2010-01 2010-01-11 123 2  2  2 
#3 2010-01 2010-01-12 180 2  2  2 
#4 2010-02 2010-02-13 129 2  4  2 
#5 2010-02 2010-02-14 129 2  4  2 
#6 2010-02 2010-02-14 184 2  4  2 
#7 2010-03 2010-03-22 145 1  5  2 
#8 2010-03 2010-03-23 180 1  5  2 
#9 2010-03 2010-03-24 145 1  5  2

來源

2017-01-10 11:58:58

感謝您付出的巨大努力。通過所有這些解決方案我們可以看到，R是一種高度靈活（因此經常是徘徊）的編程語言。看看這些解決方案，你很難說出它們來自相同的編程語言，除非你非常熟悉R提供的所有包和擴展。 – user3482393

這裏是一個純基礎R溶液。當變量不是因素，並且假定數據按月排序時，效果最好。

# get list of active monthly users 
activeUsers <- lapply(unique(dt$mth), function(i) unique(dt[dt$mth==i, "user"])) 
# get accumulating list of all users 
allUsers <- Reduce(union, activeUsers, accumulate=TRUE)

現在，每個月的所有用戶存儲在activeUsers和所有用戶的長達一個月給予越來越多的列表存儲在ALLUSERS。有了這些信息，我們可以輕鬆計算前兩個變量。

# get the calculations 
totNew <- lengths(allUsers) 
new <- c(totNew[1], diff(totNew)) 
notLastMonth <- c(totNew[1], lengths(lapply(seq_along(activeUsers)[-1], 
           function(i) setdiff(activeUsers[[i]], activeUsers[[i-1]]))))

lengths函數有效地計算每個列表項的長度。第二行使用diff來計算新用戶的數量。第二行和第三行都使用totNew變量預先設置初始值（2）。第三行涉及更多一點，它使用setdiff和lapply來構造一個月內不存在的活躍用戶集合。 lengths再次用於計數。

#merge on to data set 
merge(dt, data.frame(mth=unique(dt$mth), new=new, totNew=totNew, notLastMonth=notLastMonth), 
     by="mth") 

     mth  date user new totNew notLastMonth 
1 2010-01 2010-01-10 123 2  2   2 
2 2010-01 2010-01-12 180 2  2   2 
3 2010-01 2010-01-11 123 2  2   2 
4 2010-02 2010-02-13 129 2  4   2 
5 2010-02 2010-02-14 129 2  4   2 
6 2010-02 2010-02-14 184 2  4   2 
7 2010-03 2010-03-23 180 1  5   2 
8 2010-03 2010-03-22 145 1  5   2 
9 2010-03 2010-03-24 145 1  5   2

數據

dt <- data.frame(date,mth,user, stringsAsFactors=FALSE)

來源

2017-01-10 14:12:38 lmo

這也適用於我的大得多的數據集。有趣的功能和列表的使用。爲了總結信息，我一直廣泛地使用dplyr，但這無疑是使用基本R獲得相同結果的好方法。謝謝。 – user3482393

既然沒有人貼它，這是我的首選方式：

library(zoo) 
dt <- dt %>% mutate(ym = as.yearmon(mth)) 

ct_dt = dt %>% distinct(user, ym) %>% arrange(user, ym) %>% 
    group_by(user) %>% mutate(last_ym = dplyr::lag(ym)) %>% 
    group_by(ym) %>% summarise(
    new   = sum(is.na(last_ym)), 
    not_last_ym = sum(is.na(last_ym) | 12*(ym - last_ym) > 1) 
) 

# # A tibble: 3 x 3 
#    ym new not_last_ym 
# <S3: yearmon> <int>  <int> 
# 1  Jan 2010  2   2 
# 2  Feb 2010  2   2 
# 3  Mar 2010  1   2

從這裏，你可以採取的new的cumsum如果你真的想totNew欄目;如果你真的想查看這些數據（令人困惑地）擴展到多行，你可以left_joinct_dt與dt。

或用data.table ...

library(zoo) 
library(data.table) 
setDT(dt) 

dt[, ym := as.yearmon(mth)] 

ct_dt = setorder(unique(dt[, .(user, ym)]))[, 
    last_ym := shift(ym) 
, by=user][, .(
    new   = sum(is.na(last_ym)), 
    not_last_ym = sum(is.na(last_ym) | 12*(ym - last_ym) > 1) 
), by=ym]

來源

2017-01-10 20:08:53 Frank

我在我的大型數據集上試過這個。我對「新」得到了正確的結果，但對於「not_last_ym」卻沒有得到正確的結果，我相信它只需要一點點改變。謝謝。 – user3482393

@ user3482393如果你可以添加一個小例子來說明問題到你的初始文章，也許我可以弄明白。 – Frank

計算新的值不會更早發生並且不會發生在最後一組中

回答

相關問題