2015-10-09 71 views
0

我有(片斷)這種格式的數據:解圈的R代碼重疊的時間間隔計算

 SW_Release deviceType  configStartDate  configEndDate 
1: 04.05.00   21 2005-11-03 19:12:36 2006-02-28 10:19:27 
2: 04.05.00   16 2005-11-04 03:59:05 2006-02-28 10:19:27 
3: 04.05.00   20 2005-11-04 03:59:06 2006-02-28 10:19:27 
4: 04.05.00   15 2005-11-04 03:59:06 2006-02-28 10:19:27 
5: 04.05.00   19 2005-11-04 03:59:06 2006-02-28 10:19:27 
6: 04.05.00   17 2005-11-04 03:59:06 2006-02-28 10:19:27 
7: 04.07.03   16 2006-02-28 10:19:27 2006-03-29 01:00:39 
8: 04.07.03   20 2006-02-28 10:19:27 2006-03-29 01:00:41 
9: 04.07.01   15 2006-02-28 10:19:27 2006-03-29 01:00:41 
10: 04.07.01   19 2006-02-28 10:19:27 2006-03-29 01:00:41 
11: 04.07.01   17 2006-02-28 10:19:27 2006-03-29 01:00:42 
12: 04.07.01   21 2006-02-28 10:19:27 2006-03-29 01:00:42 
13: 04.07.01   18 2006-02-28 10:19:27 2006-03-29 01:00:42 
14: 04.07.04   16 2006-03-29 01:00:40 2006-05-01 16:07:49 
15: 04.07.04   20 2006-03-29 01:00:41 2006-05-01 16:07:50 
16: 04.07.02   15 2006-03-29 01:00:41 2006-05-01 16:07:50 
17: 04.07.02   19 2006-03-29 01:00:41 2006-05-01 16:07:51 
18: 04.07.02   17 2006-03-29 01:00:42 2006-05-01 16:07:51 
19: 04.07.02   21 2006-03-29 01:00:42 2006-05-01 16:07:51 
20: 04.07.02   18 2006-03-29 01:00:42 2006-06-01 09:45:36 
21: 04.07.04   16 2006-05-02 09:47:57 2006-06-01 09:45:25 
22: 04.07.04   20 2006-05-02 09:47:57 2006-06-01 09:45:28 
23: 04.07.02   15 2006-05-02 09:47:58 2006-06-01 09:45:31 
24: 04.07.02   19 2006-05-02 09:47:58 2006-06-01 09:45:32 
25: 04.07.02   17 2006-05-02 09:47:58 2006-06-01 09:45:34 
26: 04.07.02   21 2006-05-02 09:47:58 2006-06-01 09:45:35 
27: 04.07.05   16 2006-06-01 09:45:27 2006-08-14 17:54:15 
28: 04.07.05   20 2006-06-01 09:45:29 2006-08-14 17:54:15 
29: 04.07.06   15 2006-06-01 09:45:31 2007-12-12 11:03:00 
30: 04.07.06   19 2006-06-01 09:45:33 2007-12-12 11:03:00 
31: 04.07.03   17 2006-06-01 09:45:35 2006-08-14 17:54:16 
32: 04.07.03   21 2006-06-01 09:45:35 2006-08-14 17:54:16 
33: 04.07.04   18 2006-06-01 09:45:37 2007-12-12 11:03:00 
34: 04.07.06   16 2006-08-14 17:54:15 2007-12-12 11:02:59 
35: 04.07.06   20 2006-08-14 17:54:15 2007-12-12 11:02:59 
36: 04.07.04   17 2006-08-14 17:54:16 2007-12-12 11:03:00 
37: 04.07.04   21 2006-08-14 17:54:16 2007-12-12 11:03:00 
38: 04.05.12   14 2011-06-17 15:40:13 2012-05-24 11:43:24 

我需要添加了所有的間隔(間第二到最後一個和最後一列),但如您所見,某些行具有重疊或部分重疊的間隔。

之前,我添加了所有的日子裏,我需要完整的數據集(從上面的代碼中來)轉換成類似:

accumulated data: 
     configStartDate  configEndDate 
1: 2005-11-03 19:12:36 2007-12-12 11:03:00 
2: 2011-06-17 15:40:13 2012-05-24 11:43:24 
total days: 934.296 

下面是這樣做我的R代碼裏面(它必須是R,雖然我正在考慮重新寫在C++和使用RCPP):

merge_intervals <- function(interval_dt){ 
    interval_dt <- interval_dt[order(configStartDate), list(configStartDate, configEndDate)] 

    new_dt <- interval_dt[1, list(configStartDate, configEndDate)] 

    for (i in 2:dim(interval_dt)[1]) { 
    buff <- interval_dt[i, list(configStartDate, configEndDate)] 

    if (new_dt[dim(new_dt)[1], configEndDate] >= buff[, configStartDate]){ 
     if(new_dt[dim(new_dt)[1], configEndDate] >= buff[, configEndDate]){ 
     next 
     } 
     else{ 
     new_dt[dim(new_dt)[1], configEndDate := buff[, configEndDate]] 
     } 
    } 
    else { 
     new_dt <- rbind(new_dt, buff) 
    } 
    } 

    return(new_dt) 
} 

現在整件事花費約0.16秒,(與其他計算)上運行,但是,對於3000個獨特的資產,創建計算時間開銷8分鐘。

如何將for循環轉換成更快的東西來減少計算時間?謝謝!

+0

應該可以做矢量化。你想如何處理重疊的時間間隔?忽略重疊或將間隔合併成一個新的間隔,只考慮新的間隔? – Thierry

+1

對不起,但您的示例並未向我明確說明您要執行的操作。你如何從你在第一個街區顯示的10個街區(全部在2006年)到第二個街區的兩個街區(跨度爲2005-2012)?你能準確地描述如何從樣本輸入到預期輸出? – josliber

+0

我編輯了樣本以包含所有行以使其更清晰。 –

回答

0

像這樣?

df <- data.frame(
    id = 1:3, 
    start = Sys.time() + c(0, 1000, 3000), 
    end = Sys.time() + c(1500, 2000, 4000) 
) 
library(dplyr) 
df %>% 
    mutate(
    overlap = lead(start, 1, default = TRUE) < end, 
    interval = cumsum(overlap) 
) %>% 
    group_by(interval) %>% 
    summarise(start = min(start), end = max(end)) %>% 
    mutate(delta = end - start) %>% 
    summarise(total = sum(delta))