2016-09-28 205 views
2

我已經在這個數據幀由END TIME排序:檢查重疊的時間間隔開始和結束時間

df = data.frame(ID= c(1,1,1,1,1,1,1), NumberInSequence= c(1,2,3,4,5,6,7), 
       StartTime = as.POSIXct(c("2016-01-15 18:02:11 GMT","2016-01-15 18:10:33 GMT","2016-01-15 18:25:08 GMT", 
               "2016-01-15 18:33:56 GMT","2016-01-15 18:21:03 GMT","2016-01-15 19:55:09 GMT","2016-01-15 19:57:03 GMT")) , 
         EndTime = as.POSIXct(c("2016-01-15 18:02:17 GMT","2016-01-15 18:10:39 GMT","2016-01-15 18:25:14 GMT", 
               "2016-01-15 18:34:02 GMT","2016-01-15 19:53:17 GMT","2016-01-15 19:56:15 GMT","2016-01-15 19:58:17 GMT")) 
         ) 

每一行是具有開始時間和結束時間的時間間隔

df 

ID NumberInSequence   StartTime    EndTime 
1 1    1 2016-01-15 18:02:11 2016-01-15 18:02:17 
2 1    2 2016-01-15 18:10:33 2016-01-15 18:10:39 
3 1    3 2016-01-15 18:25:08 2016-01-15 18:25:14 
4 1    4 2016-01-15 18:33:56 2016-01-15 18:34:02 
5 1    5 2016-01-15 18:21:03 2016-01-15 19:53:17 
6 1    6 2016-01-15 19:55:09 2016-01-15 19:56:15 
7 1    7 2016-01-15 19:57:03 2016-01-15 19:58:17 

然後我使用dplyr添加計算下一個開始時間的幾個字段以及NextStartTime和EndTime之間的差異的等待時間。這會創建「WaitTime」列,它在大多數情況下都適用,除非存在重疊的Inverals。

df %>% group_by(ID) %>% 
     mutate(
     NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)] , 
     WaitTime = difftime(NextStartTime,EndTime, units = 's') 
     #max_s = max(StartTime) #, 
    # cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01") 
    ) 


    ID NumberInSequence   StartTime    EndTime  NextStartTime WaitTime 
1 1    1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs 
2 1    2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs 
3 1    3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs 
4 1    4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs 
5 1    5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs 
6 1    6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs 
7 1    7 2016-01-15 19:57:03 2016-01-15 19:58:17    <NA> NA secs 

現在我需要添加稱爲 「FLAG」 與值是OK或NOT OK柱其中

「OK」指間隔不是enitrely OR部分另一間隔內任一。因此,「OK」的間隔與其他間隔沒有重疊。

「NOT OK」表示間隔IS部分地或完全地以另一間隔爲間隔。因此,「不好」的間隔與其他間隔重疊。

我有以下的間隔和什麼旗柱的結果應該是一個簡短的描述

StartTime    EndTime    FLAG 
2016-01-15 18:02:11 2016-01-15 18:02:17  OK - this interval does not overlap with other intervals 
2016-01-15 18:10:33 2016-01-15 18:10:39  OK - this interval does not overlap with other intervals 
2016-01-15 18:25:08 2016-01-15 18:25:14  NOT OK - this inerval is within the 18:21:03 start time interval 
2016-01-15 18:33:56 2016-01-15 18:34:02  NOT OK - this inerval is within the 18:21:03 start time interval 
2016-01-15 18:21:03 2016-01-15 19:53:17  NOT OK - this interval contains other intervals 
2016-01-15 19:55:09 2016-01-15 19:56:15  OK - this interval does not overlap with other intervals 
2016-01-15 19:57:03 2016-01-15 19:58:17  OK - this interval does not overlap with other intervals 

我一直在尋找在dplyr使用芹菜或cummax .....也許...... 。

cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01") 

回答

2

這是我的嘗試。我認爲在data.table包中的foverlaps()是我們這種情況下的朋友。你可以在SO上找到一些例子。您想檢查它們以瞭解功能。您需要創建一個包含開始和結束時間的虛擬data.table。在你的情況下,你有他們。我用最少的信息創建了dummy。然後,您使用setkey()並利用foverlaps()

# Create a dummy dt for hoverlaps. 
dummy <- setDT(df2)[, 1:4, with = FALSE] 

# Use foverlaps(). 
setkey(setDT(df2), StartTime, EndTime) 
foo <- foverlaps(dummy, setDT(df2), by.x = c("StartTime", "EndTime")) 

現在,該清理數據了。對於每個NumberInSequence,如果有超過1個重疊間隔(n> 1),請移除具有相同開始和結束時間(StartTime == i.StartTime & EndTime == i.EndTime)的行。然後,刪除每個NumberInSequence的重複行。如果你只有一行表示與另一個區間重疊,那就夠了,對嗎?最後,如果StartTime == i.StartTime & EndTime == i.EndTimeTRUE,那意味着沒有其他區間與區間重疊。所以,你說OK。否則,NOT OK。如有必要,稍後刪除多餘的列。

foo[,.SD[!(StartTime == i.StartTime & EndTime == i.EndTime & .N > 1)], 
     by = c("ID","NumberInSequence")][!duplicated(NumberInSequence)][, 
      check := ifelse(StartTime == i.StartTime & EndTime == i.EndTime, 
          "OK", "NOT OK")] -> out  
print(out) 

# ID NumberInSequence   StartTime    EndTime  NextStartTime WaitTime i.ID i.NumberInSequence 
#1: 1    1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs 1     1 
#2: 1    2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs 1     2 
#3: 1    5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs 1     3 
#4: 1    3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs 1     5 
#5: 1    4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs 1     5 
#6: 1    6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs 1     6 
#7: 1    7 2016-01-15 19:57:03 2016-01-15 19:58:17    <NA> NA secs 1     7 

#   i.StartTime   i.EndTime check 
#1: 2016-01-15 18:02:11 2016-01-15 18:02:17  OK 
#2: 2016-01-15 18:10:33 2016-01-15 18:10:39  OK 
#3: 2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK 
#4: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK 
#5: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK 
#6: 2016-01-15 19:55:09 2016-01-15 19:56:15  OK 
#7: 2016-01-15 19:57:03 2016-01-15 19:58:17  OK 
相關問題