2013-06-18 28 views
5

我有一個數據集,其中約500k約會時間在5到60分鐘之間。如何計算大型數據集的每分鐘發生次數

tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200, 1325494800, 1325494800, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325497500, 1325497500, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300), class = c("POSIXct", "POSIXt"), tzone = "GMT"), End = structure(c(1325493600, 1325494200, 1325494500, 1325495400, 1325495400, 1325496000, 1325496000, 1325496600, 1325496600, 1325496900, 1325496900, 1325498100, 1325498100, 1325498400, 1325498700, 1325498700, 1325499000, 1325499300, 1325499600, 1325499600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), Location = c("LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationA", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB", "LocationB"), Room = c("RoomA", "RoomA", "RoomA", "RoomA", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomB", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA", "RoomA")), .Names = c("Start", "End", "Location", "Room"), row.names = c(NA, 20L), class = "data.frame") 
> head(tdata) 
       Start     End Location Room 
1 2012-01-02 08:30:00 2012-01-02 08:40:00 LocationA RoomA 
2 2012-01-02 08:40:00 2012-01-02 08:50:00 LocationA RoomA 
3 2012-01-02 08:50:00 2012-01-02 08:55:00 LocationA RoomA 
4 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomA 
5 2012-01-02 09:00:00 2012-01-02 09:10:00 LocationA RoomB 
6 2012-01-02 09:10:00 2012-01-02 09:20:00 LocationA RoomB 

我想計算數量的併發約會的總量,每個位置和每個房間(和其他一些因素去原始數據集)。

我一直在使用mysql包執行左連接,它適用於小數據集的嘗試,但永遠需要對整個數據集:

# SQL Join. 
start.min <- min(tdata$Start, na.rm=T) 
end.max <- max(tdata$End, na.rm=T) 
tinterval <- seq.POSIXt(start.min, end.max, by = "mins") 
tinterval <- as.data.frame(tinterval) 

library(sqldf) 
system.time(
    output <- sqldf("SELECT * 
       FROM tinterval 
       LEFT JOIN tdata 
       ON tinterval.tinterval >= tdata.Start 
       AND tinterval.tinterval < tdata.End ")) 

head(output) 
      tinterval    Start     End Location Room 
1 2012-01-02 09:30:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
2 2012-01-02 09:31:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
3 2012-01-02 09:32:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
4 2012-01-02 09:33:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
5 2012-01-02 09:34:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 
6 2012-01-02 09:35:00 2012-01-02 09:30:00 2012-01-02 09:40:00 LocationA RoomA 

它創建了一個數據幀,所有的「主動」約會每分鐘都會列出。大型數據集涵蓋全年(約525600分鐘)。平均預約時間爲18分鐘,我預計sql join將創建一個約500萬行的數據集,我可以使用它創建不同因素(位置/房間等)的佔用情節。

建立在sapply解決方案建議在How to count number of concurrent users我嘗試使用data.tablesnowfall如下:

require(snowfall) 
require(data.table) 
sfInit(par=T, cpu=4) 
sfLibrary(data.table) 

tdata <- data.table(tdata) 
tinterval <- seq.POSIXt(start.min, end.max, by = "mins") 
setkey(tdata, Start, End) 
sfExport("tdata") # "Transport" data to cores 

system.time(output <- data.frame(tinterval,sfSapply(tinterval, function(i) length(tdata[Start <= i & i < End,Start])))) 

> head(output) 
      tinterval sfSapply.tinterval..function.i..length.tdata.Start....i...i... 
1 2012-01-02 08:30:00                1 
2 2012-01-02 08:31:00                1 
3 2012-01-02 08:32:00                1 
4 2012-01-02 08:33:00                1 
5 2012-01-02 08:34:00                1 
6 2012-01-02 08:35:00                1 

該解決方案是快速的,大約需要18秒計算1天(滿一年約2小時) 。缺點是我無法爲某些因素(位置,房間等)創建多個併發約會的子集。我有這樣的感覺,必須有更好的方式來做到這一點..任何建議?

UPDATE: 根據傑弗裏的回答,最終解決方案看起來像這樣。這個例子顯示了每個地點的入住率是如何確定的。

setkey(tdata, Location, Start, End) 
vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60) 
res <- data.frame(time=vecTime) 

for(i in 1:length(unique(tdata$Location))) { 
    addz <- array(0,length(vecTime)) 
    remz <- array(0,length(vecTime)) 

    tdata2 <- tdata[J(unique(tdata$Location)[i]),] # Subset a certain location. 

    startAgg <- aggregate(tdata2$Start,by=list(tdata2$Start),length) 
    endAgg <- aggregate(tdata2$End,by=list(tdata2$End),length) 
    addz[which(vecTime %in% startAgg$Group.1)] <- startAgg$x 
    remz[which(vecTime %in% endAgg$Group.1)] <- -endAgg$x 

    res[,c(unique(tdata$Location)[i])] <- cumsum(addz + remz) 
} 

> head(res) 
       time LocationA LocationB 
1 2012-01-01 03:30:00   1   0 
2 2012-01-01 03:31:00   1   0 
3 2012-01-01 03:32:00   1   0 
4 2012-01-01 03:33:00   1   0 
5 2012-01-01 03:34:00   1   0 
6 2012-01-01 03:35:00   1   0 
+0

很高興提供有用的答案。只是一個指針。 – Arun

回答

3

這是否更好。

創建一個空白時間向量和一個空白計數向量。

vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60) 
addz <- array(0,length(vecTime)) 
remz <- array(0,length(vecTime)) 


startAgg <- aggregate(tdata$Start,by=list(tdata$Start),length) 
endAgg <- aggregate(tdata$End,by=list(tdata$End),length) 
addz[which(vecTime %in% startAgg$Group.1)] <- startAgg$x 
remz[which(vecTime %in% endAgg$Group.1)] <- -endAgg$x 
res <- data.frame(time=vecTime,occupancy=cumsum(addz + remz)) 
+0

非常感謝傑弗裏,但是這並不包括在某個時間段內活躍的約會數量。這告訴我有兩個約會開始於9:00,但不考慮活動約會(已經開始但未結束)。我需要每分鐘的入住率來研究真正繁忙時期的高峯。 – TimV

+0

Hee Goeffrey,您的解決方案花了9秒鐘處理我的整個數據集。我一直在掙扎幾個小時。非常感謝您的意見。我一直在尋找一個錯誤的方向:聚合所有約會的開始和結束時間並根據這個時間確定入住率真的很棒。考慮到計算的速度,我可以爲每個位置或每個房間建立佔用情節,並附上一些for循環,所以我認爲我的問題得到了回答。 – TimV

0

我不完全確定,如果我理解你的目標。儘管如此,這可能是有用的:

#I changed the example to actually have concurrent appointments 
DF <- read.table(text="    Start,     End, Location, Room 
1, 2012-01-02 08:30:00, 2012-01-02 08:40:00, LocationA, RoomA 
2, 2012-01-02 08:40:00, 2012-01-02 08:50:00, LocationA, RoomA 
3, 2012-01-02 08:50:00, 2012-01-02 09:55:00, LocationA, RoomA 
4, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomA 
5, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomB 
6, 2012-01-02 09:10:00, 2012-01-02 09:20:00, LocationA, RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE) 

DF$Start <- as.POSIXct(DF$Start,format="%Y-%d-%m %H:%M:%S",tz="GMT") 
DF$End <- as.POSIXct(DF$End,format="%Y-%d-%m %H:%M:%S",tz="GMT") 

library(data.table) 
DT <- data.table(DF) 
DT[,c("Start_num","End_num"):=lapply(.SD,as.numeric),.SDcols=1:2] 

fun <- function(s,e) { 
    require(intervals) 
    mat <- cbind(s,e) 
    inter <- Intervals(mat,closed=c(FALSE,FALSE),type="R") 
    io <- interval_overlap(inter, inter) 
    tablengths <- table(sapply(io,length))[-1] 
    sum(c(0,as.vector(tablengths/as.integer(names(tablengths))))) 
} 

#number of overlapping events per room and location 
DT[,fun(Start_num,End_num),by=list(Location,Room)] 
#  Location Room V1 
#1: LocationA RoomA 1 
#2: LocationA RoomB 0 

我沒有測試這個,特別是對於速度。

+0

謝謝羅蘭。有趣的方法,但我一直在尋找每分鐘的總入住率,並能夠爲地點和房間分配住房。 – TimV

0

下面是一個策略 - 按開始時間排序,然後通過開始,結束,開始,結束......取消數據並查看該向量是否需要重新排序。如果沒有,那麼就沒有衝突,如果有的話,你可以看到有多少個約會(以及如果你喜歡哪個約會)相互衝突。

# Using Roland's example: 
DF <- read.table(text="    Start,     End, Location, Room 
1,2012-01-02 08:30:00,2012-01-02 08:40:00,LocationA,RoomA 
2,2012-01-02 08:40:00,2012-01-02 08:50:00,LocationA,RoomA 
3,2012-01-02 08:50:00,2012-01-02 09:55:00,LocationA,RoomA 
4,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomA 
5,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomB 
6,2012-01-02 09:10:00,2012-01-02 09:20:00,LocationA,RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE) 

dt = data.table(DF) 

# the conflicting appointments 
dt[order(Start), 
    .SD[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)], 
    by = list(Location, Room)] 
# Location Room    Start     End 
#1: LocationA RoomA 2012-01-02 08:50:00 2012-01-02 09:55:00 
#2: LocationA RoomA 2012-01-02 09:00:00 2012-01-02 09:10:00 

# and a speedier version of the above, that avoids constructing the full .SD: 
dt[dt[order(Start), 
     .I[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)], 
     by = list(Location, Room)]$V1] 

也許從無與倫比爲了糾正上述指標去公式可以簡化,我並沒有花太多時間考慮這個問題,只是使用的完成了任務的第一件事情。