幫我用一個「應用」函數替換一個for循環

我的任務是找到用戶參與遊戲的最長連續日子。

我選擇使用R的rle函數來獲取最長的條紋，然後用結果更新我的數據庫表，而不是編寫一個sql函數。

的（附後）據幀是這樣的：

day  user_id 
2008/11/01 2001 
2008/11/01 2002 
2008/11/01 2003 
2008/11/01 2004 
2008/11/01 2005 
2008/11/02 2001 
2008/11/02 2005 
2008/11/03 2001 
2008/11/03 2003 
2008/11/03 2004 
2008/11/03 2005 
2008/11/04 2001 
2008/11/04 2003 
2008/11/04 2004 
2008/11/04 2005

我嘗試以下每用戶最長連勝獲得

# turn it to a contingency table 
my_table <- table(user_id, day) 

# get the streaks 
rle_table <- apply(my_table,1,rle) 

# verify the longest streak of "1"s for user 2001 
# as.vector(tapply(rle_table$'2001'$lengths, rle_table$'2001'$values, max)["1"]) 

# loop to get the results 
# initiate results matrix 
res<-matrix(nrow=dim(my_table)[1], ncol=2) 

for (i in 1:dim(my_table)[1]) { 
string <- paste("as.vector(tapply(rle_table$'", rownames(my_table)[i], "'$lengths, rle_table$'", rownames(my_table)[i], "'$values, max)['1'])", sep="") 
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string))) 
}

不幸的是這個for循環的時間太長而我想，如果有一種方法可以使用「應用」系列中的函數生成res矩陣。

預先感謝您

來源

2009-10-01 George Dontas

如果你要在SQL中做（或R中sqldf）有這等一個很好的討論，讓線程的http：/ /stackoverflow.com/questions/1176011/sql-to-determine-minimum-sequential-days-of-access/1176255#1176255 – 2009-10-01 16:25:29

你爲什麼使用該粘貼/評估計劃？看起來這會給你帶來巨大的性能打擊？ – 2009-10-01 16:53:23

我同意喬納森;這也使得它很難閱讀。那是你想要做什麼？ res.1 <-matrix（nrow = dim（my.table）[1]，ncol = 2） for（i in 1：dim（my.table）[1]）{ ] < - c（as.integer（rownames（my.table）[i]），sum（as.table（my.table）[i，]））} – Vince 2009-10-01 17:44:26

另一種選擇

# convert to Date 
day_table$day <- as.Date(day_table$day, format="%Y/%m/%d") 
# split by user and then look for contiguous days 
contig <- sapply(split(day_table$day, day_table$user_id), function(.days){ 
    .diff <- cumsum(c(TRUE, diff(.days) != 1)) 
    max(table(.diff)) 
})

來源

2010-01-12 07:34:25

的apply功能並不總是（甚至一般）比for循環更快。這是R與S-Plus相關聯的剩餘部分（在後者中，應用比用於更快）。 lapply是一個例外，它通常比for快（因爲它使用C代碼）。 See this related question。

所以你應該使用apply主要是爲了提高代碼的清晰度，而不是提高性能。

您可能需要find Dirk's presentation on high-performance computing useful。另一個蠻力方法是"just-in-time compilation" with Ra instead of the normal R version，它被優化以處理for循環。

[編輯：]顯然有很多方法可以實現這一點，即使它更緊湊，這絕不是更好。只要你的代碼的工作，這裏的另一種方法：

dt <- data.frame(table(dat))[,2:3] 
dt.b <- by(dt[,2], dt[,1], rle) 
t(data.frame(lapply(dt.b, function(x) max(x$length))))

你可能會需要一點點進一步控制輸出。

來源

2009-10-01 16:06:48 Shane

編輯：修正。我原本以爲我將不得不修改大部分rle（），但事實證明只需要一些調整。

這不是關於* apply方法的答案，但我想知道這是否可能不是整個過程的更快方法。正如Shane所說，循環並不是那麼糟糕。而且......我很少會向任何人展示我的代碼，所以我很樂意聽到對此的批評。

#Shane, I told you this was awesome 
dat <- getSOTable("http://stackoverflow.com/questions/1504832/help-me-replace-a-for-loop-with-an-apply-function", 1) 
colnames(dat) <- c("day", "user_id") 
#Convert to dates so that arithmetic works properly on them 
dat$day <- as.Date(dat$day) 

#Custom rle for dates 
rle.date <- function (x) 
{ 
    #Accept only dates 
    if (class(x) != "Date") 
     stop("'x' must be an object of class \"Date\"") 
    n <- length(x) 
    if (n == 0L) 
     return(list(lengths = integer(0L), values = x)) 
    #Dates need to be sorted 
    x.sort <- sort(x) 
    #y is a vector indicating at which indices the date is not consecutive with its predecessor 
    y <- x.sort[-1L] != (x.sort + 1)[-n] 
    #i returns the indices of y that are TRUE, and appends the index of the last value 
    i <- c(which(y | is.na(y)), n) 
    #diff tells you the distances in between TRUE/non-consecutive dates. max gets the largest of these. 
    max(diff(c(0L, i))) 
} 

#Loop 
max.consec.use <- matrix(nrow = length(unique(dat$user_id)), ncol = 1) 
rownames(max.consec.use) <- unique(dat$user_id) 

for(i in 1:length(unique(dat$user_id))){ 
    user <- unique(dat$user_id)[i] 
    uses <- subset(dat, user_id %in% user) 
    max.consec.use[paste(user), 1] <- rle.date(uses$day) 
} 

max.consec.use

來源

2009-10-01 19:40:00

忘記添加：getSOTable函數來自Shane的答案在這裏：http://stackoverflow.com/questions/1434897/how-do-i-load-example-datasets-in-r/1434927#1434927 – 2009-10-01 19:53:03

哦，這是甜的。謝謝Shane。 – kpierce8 2009-10-01 20:05:57

如果你有一個非常長的數據列表，而不是聽起來可能是一個羣集問題。每個羣集將由用戶定義並且最大間隔距離爲1的日期。然後由用戶檢索最大的羣集。如果我想到特定的方法，我會編輯它。

來源

2009-10-01 19:48:03 kpierce8

這是Chris's suggestion for how to get the data：

dat <- read.table(textConnection(
"day  user_id 
2008/11/01 2001 
2008/11/01 2002 
2008/11/01 2003 
2008/11/01 2004 
2008/11/01 2005 
2008/11/02 2001 
2008/11/02 2005 
2008/11/03 2001 
2008/11/03 2003 
2008/11/03 2004 
2008/11/03 2005 
2008/11/04 2001 
2008/11/04 2003 
2008/11/04 2004 
2008/11/04 2005 
"), header=TRUE)

來源

2009-10-01 20:01:03 Shane

...是的，這可能會更明智一些。但我不時在編程中喜歡一點魔力。 – 2009-10-01 20:18:08

幫我用一個「應用」函數替換一個for循環

回答

相關問題