更快速地計算5分鐘內發生的事件？

我有一個矩陣，events，其中包含500萬事件的發生次數。這500萬個事件中的每一個都有一個「類型」，範圍從1到2000.矩陣的一個非常簡化的版本如下。「時間」的單位是1970年以來的秒數。所有事件都發生在2012年1月1日以後。更快速地計算5分鐘內發生的事件？

>events 
     type   times 
     1   1352861760 
     1   1362377700 
     2   1365491820 
     2   1368216180 
     2   1362088800 
     2   1362377700

我試圖劃分時間，因爲1/1/2012到5分鐘的桶，然後填充這些桶的使用已經發生了多少i類型的每個事件的每個桶中。我的代碼如下。請注意0是一個包含1-2000的每種可能類型的矢量，並且by設置爲300，因爲這是5分鐘內的多少秒。

for(i in 1:length(types)){ 
    local <- events[events$type==types[i],c("type", "times")] 
    assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300)))) 
}

這導致變量a1通過a2000其中包含如何i類型的許多出現有在每個5分鐘的桶的行向量。

我開始然後找到「A1」和「A2000」之間的所有成對的相關性。

有沒有辦法來優化我上面提供的代碼塊？它運行得非常緩慢，但我想不出一種更快的方法。也許水桶太多，時間太少。

任何有識之士將不勝感激。

重複的例子：

>head(events) 
    type   times 
     12   1308575460 
     12   1308676680 
     12   1308825420 
     12   1309152660 
     12   1309879140 
     25   1309946460 

xevents <- xts(events[,"type"],.POSIXct(events[,"times"])) 
ep <- endpoints(xevents, "minutes", 5) 
counts <- period.apply(xevents, ep, tabulate, nbins=length(types)) 

>head(counts) 
         1 2 3 4 5 6 7 8 9 10 11 12 13 14 
2011-06-20 09:11:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-21 13:18:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-23 06:37:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-06-27 01:31:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-07-05 11:19:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
2011-07-06 06:01:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

>> ep[1:20] 
[1] 0 1 2 3 4 5 6 7 8 9 10 12 20 21 22 23 24 25 26 27

以上就是我一直在使用的代碼，但問題是，它沒有被5分鐘遞增的：它只是由真實事件的發生增加。

來源

2013-07-24 user2588829

你的「可重現的例子」不是[reproducible]（http://stackoverflow.com/q/5963269/271616），而且你不顯示你想要的輸出但是我認爲你需要每5分鐘進行一次觀察，無論你是否真的在那段時間內有數據。 –

我會爲此使用xts包。使用period.apply和endpoints函數可輕鬆運行5分鐘不重疊的功能。

# create sample data 
library(xts) 
set.seed(21) 
N <- 1e6 
events <- cbind(sample(2000, N, replace=TRUE), 
    as.POSIXct("2012-01-01")+sample(1e7,N)) 
colnames(events) <- c("type","times") 
# create xts object 
xevents <- xts(events[,"type"], .POSIXct(events[,"times"])) 
# find the last row of each non-overlapping 5-minute interval 
ep <- endpoints(xevents, "minutes", 5) 
# count the number of occurrences of each "type" 
counts <- period.apply(xevents, ep, tabulate, nbins=2000) 
# set colnames 
colnames(counts) <- paste0("a",1:ncol(counts)) 
# calculate correlation 
#cc <- cor(counts)

更新迴應OP的意見/編輯：

# Create a sequence of 5-minute steps, from the actual start of the data 
m5 <- seq(round(start(xevents),'mins'), end(xevents), by='5 mins') 
# Create a sequence of 5-minute steps, from the start of 2012-01-01 
m5 <- seq(as.POSIXct("2012-01-01"), end(xevents), by='5 mins') 
# merge xevents with empty 5-minute xts object, and 
# subtract 1 second, so endpoints are at end of each 5-minute interval 
xevents5 <- merge(xevents, xts(,m5-1)) 
ep5 <- endpoints(xevents5, "minutes", 5) 
counts5 <- period.apply(xevents5, ep5, tabulate, nbins=2000) 
colnames(counts5) <- paste0("a",1:ncol(counts5)) 
# align to the beginning of each 5-minute interval, if you want 
counts5 <- align.time(counts5,60*5)

來源

2013-07-24 21:46:56

這段代碼太棒了！直到現在，從來不知道xts庫。然而，.POSIXct步驟會將我的日期轉換爲錯誤，導致錯誤計算......任何想法如何解決這個問題？ – user2588829

@ user2588829：如果你不那麼模糊，我會想法如何解決這個問題......「把我的日期轉換成錯誤的」並不告訴我。 –

好吧，使用.POSIXct函數轉換它（我使用的確切函數是：as.POSIXct（strptime（x，format =「％m /％d /％y％H：％M：％S」）， tz =「GMT」），origin =「1970-01-01」）'）正在製作最初於2012年11月14日02:56進入1970-01-07 14:28:44的內容。 – user2588829

cut它在times的range之內，就像你做的那樣。之後，您可以使用table或xtabs進行製表，但是對於整個數據集，可以生成一個矩陣。類似如下：

r <- trunc(range(events$times)/300) * 300 
events$times.bin <- cut(events$times, seq(r[1], r[2] + 300, by=300)) 
xtabs(~type+times.bin, events, drop.unused.levels=T)

決定是否要drop.unused.levels或不。有了這種數據，您可能還想創建一個sparse矩陣。

來源

2013-07-24 21:29:08 krlmlr

您是否嘗試在500萬行上運行此操作？我問，因爲我的電腦被鎖定，當我試圖運行它在100萬... –

@JoshuaUlrich：不，沒有。你用過'稀疏= T'嗎？ – krlmlr

擁有5萬條記錄，我可能會使用data.table。你可以這樣做：

# First we make a sequence of the buckets from initial time to at least the end time + 5 minutes 
buckets <- seq(from = min(df$times) , by = 300 , to = max(df$times)+300) 

require(data.table) 
DT <- data.table(df) 

# Work out what bucket each time is in 
DT[ , list(Bucket = which.max(times <= buckets)) , by = "times" ] 

# Aggregate events by type and time bucket 
DT[ , list(Count = length(type)) , by = list(type, bucket) ] 
    type bucket Count 
1: 1  1  1 
2: 1 31721  1 
3: 2 42102  1 
4: 2 51183  1 
5: 2 30758  1 
6: 2 31721  1

來源

2013-07-24 21:59:38

更快速地計算5分鐘內發生的事件？

回答

相關問題