2013-08-03 16 views
0

兩個密切相關的帖子是herehere。我無法將其中的任何一種翻譯成我的確切情況。將事件列表轉換爲每兩分鐘一系列事件數

這裏是次向量:

start.time = as.POSIXct("2013-06-20 01:00:00") 
x = start.time + runif(5, min = 0, max = 8*60) 
x = x[order(x)] 
x 
# [1] "2013-06-20 01:00:30 EDT" "2013-06-20 01:00:57 EDT" 
# [3] "2013-06-20 01:01:43 EDT" "2013-06-20 01:04:01 EDT" 
# [5] "2013-06-20 01:04:10 EDT" 

接下來,這裏是兩分鐘的標記物的載體:

y = seq(as.POSIXct("2013-06-20 01:00:00"), as.POSIXct("2013-06-20 01:06:00"), 60*2) 
y 
# [1] "2013-06-20 01:00:00 EDT" "2013-06-20 01:02:00 EDT" 
# [3] "2013-06-20 01:04:00 EDT" "2013-06-20 01:06:00 EDT" 

我想一個快速,光滑的,可擴展的方式產生落在到兩分鐘倉到的y每個元件的右邊的x的元素,這樣的計數:

    y count.x 
1 2013-06-20 01:00:00  3 
2 2013-06-20 01:02:00  0 
3 2013-06-20 01:04:00  2 
4 2013-06-20 01:06:00  0 

回答

3

如何

as.data.frame(table(cut(x, breaks=c(y, Inf)))) 

       Var1 Freq 
1 2013-06-20 01:00:00 3 
2 2013-06-20 01:02:00 0 
3 2013-06-20 01:04:00 2 
4 2013-06-20 01:06:00 0 
+0

哎呀!工作。 – zkurtz

0

這裏是一個解決問題的功能,並運行速度遠遠超過table(cut(...))

get.bin.counts = function(x, name.x = "x", start.pt, end.pt, bin.width){ 
    br.pts = seq(start.pt, end.pt, bin.width) 
    x = x[(x >= start.pt)&(x <= end.pt)] 
    counts = hist(x, breaks = br.pts, plot = FALSE)$counts 
    dfm = data.frame(br.pts[-length(br.pts)], counts) 
    names(dfm) = c(name.x, "freq") 
    return(dfm) 
} 

這裏的關鍵行是在中間 - counts = hist(...。繪圖選項設置爲FALSEhist函數確實至關重要。

爲了測試這個功能的高速性能,我跑,如下所示:

# First define x, a large vector of times:  
start.time = as.POSIXct("2012-11-01 00:00:00") 
x = start.time + runif(50000, min = 0, max = 365*24*3600) 
x = x[order(x)] 
# Apply the function, keeping track of running time: 
t1 = Sys.time() 
dfm = get.bin.counts(x, name.x = "time", 
        start.pt = as.POSIXct("2012-11-01 00:00:00"), 
        end.pt = as.POSIXct("2013-07-01 00:00:00"), 
        bin.width = 60) 
as.numeric(Sys.time()-t1) #prints elapsed time 

有了這個例子,我的功能比跑快table(cut(...))比10信用的因素多一點是因爲cuthelp page,其中指出,「而不是table(cut(x, br)),hist(x, br, plot = FALSE)是更有效率和更少的內存飢餓。」