2016-06-21 13 views
1

我正在尋找一種方法來計算每個組ID的時差。這裏是我的數據的一部分:()來自dput輸出如何使用r或sql來計算每個組ID的差異?

ID road beginTime endTime Mon Tue Wed Thu Fri Sat 
666 757  9:00 AM  11:45 AM     S 
555 758  1:55 PM  3:45 PM M  W   
555 759  10:40 AM 12:30 PM M  W   
555 760  4:00 PM  5:50 PM  Tue  R  
444 761  3:00 PM  4:25 PM  Tue  R  
444 762  4:30 PM  7:15 PM M     
444 763  12:50 PM 2:40 PM     Fri 
444 764  10:40 AM 11:35 AM Tue  R  
222 765  11:45 AM 2:30 PM M  W   
222 766  6:00 PM  9:40 PM    R  
333 767  8:30 AM  11:15 AM M  W   
333 768  8:30 AM  11:15 AM Tue  R  
333 769  1:25 PM  2:50 PM  Tue  R  
333 770  11:45 AM 1:10 PM M  W   

structure(list(ID = c(666L, 555L, 555L, 555L, 444L, 444L, 444L, 
444L, 222L, 222L, 333L, 333L, 333L, 333L), road = 757:770, beginTime = structure(c(11L, 
2L, 3L, 7L, 6L, 8L, 5L, 3L, 4L, 9L, 10L, 10L, 1L, 4L), .Label = c("1:25 PM", 
"1:55 PM", "10:40 AM", "11:45 AM", "12:50 PM", "3:00 PM", "4:00 PM", 
"4:30 PM", "6:00 PM", "8:30 AM", "9:00 AM"), class = "factor"), 
    endTime = structure(c(4L, 9L, 5L, 11L, 10L, 12L, 7L, 3L, 
    6L, 13L, 2L, 2L, 8L, 1L), .Label = c("1:10 PM", "11:15 AM", 
    "11:35 AM", "11:45 AM", "12:30 PM", "2:30 PM", "2:40 PM", 
    "2:50 PM", "3:45 PM", "4:25 PM", "5:50 PM", "7:15 PM", "9:40 PM" 
    ), class = "factor"), Mon = structure(c(1L, 2L, 2L, 1L, 1L, 
    2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L), .Label = c("", "M"), class = "factor"), 
    Tue = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 
    1L, 2L, 2L, 1L), .Label = c("", "Tue"), class = "factor"), 
    Wed = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
    2L, 1L, 1L, 2L), .Label = c("", "W"), class = "factor"), 
    Thu = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
    1L, 2L, 2L, 1L), .Label = c("", "R"), class = "factor"), 
    Fri = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = c("", "Fri"), class = "factor"), 
    Sat = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = c("", "S"), class = "factor")), .Names = c("ID", 
"road", "beginTime", "endTime", "Mon", "Tue", "Wed", "Thu", "Fri", 
"Sat"), class = "data.frame", row.names = c(NA, -14L)) 

每個ID開車在不同的時間不同的道路(公路)一天(BEGINTIME,結束時間)。我想計算每個ID的等待(非駕駛)時間。例如,週一和週三,ID = 555開車。第一階段是上午10點40分至下午12點30分。它等待了1.41小時,然後在1:55 - 3:45之間又開始了一段時間。 1.41小時的等待時間是我需要的。當週二和週四這個ID開車時,還有另一個等待時間。對於ID = 666,它只在週六開車一段時間,因此等待時間爲0.我的數據的困難是每個ID每天都有不同的時段。有什麼建議麼?非常感謝!

+0

你可以'輸入'你的數據,所以我們可以測試? – 989

+0

我建議你使用dayofweek將寬(平日)轉換爲長格式,或者如果你有超過一週的時間,則使用date來代替。你的領域將是'Id','road','beginTime','endTime','date'。從那裏,你可以更容易地使用像'aggregate'或'dplyr :: group_by'這樣的函數按日期/日期分組,然後使用lead或lag來找出組內行之間的時間。 – r2evans

+0

@ r2evans,我以前曾試過這種方式。但是,例如,ID = 555每週驅車兩天,我怎麼能把兩天都放在一個「日期」列中? – user5843090

回答

0

使用我在評論中提到的「長」格式使事情變得更容易一些。

首先,我會收拾你的數據一點:轉換系數爲字符串,然後串倍(df是你的數據如上dput ED):

library(dplyr) 
# small helper function 
astime <- function(x) as.POSIXct(x, format = "%I:%M %p") 
df2 <- df %>% 
    mutate_each(funs(as.character), beginTime:Sat) %>% 
    mutate_each(funs(astime), beginTime, endTime) 
head(df2) 
# ID road   beginTime    endTime Mon Tue Wed Thu Fri Sat 
# 1 666 757 2016-06-21 09:00:00 2016-06-21 11:45:00      S 
# 2 555 758 2016-06-21 13:55:00 2016-06-21 15:45:00 M  W    
# 3 555 759 2016-06-21 10:40:00 2016-06-21 12:30:00 M  W    
# 4 555 760 2016-06-21 16:00:00 2016-06-21 17:50:00  Tue  R   
# 5 444 761 2016-06-21 15:00:00 2016-06-21 16:25:00  Tue  R   
# 6 444 762 2016-06-21 16:30:00 2016-06-21 19:15:00 M      

(不要擔心日期都錯了,應該被忽略)現在我從廣角轉換爲長,去除那些日子是空字符串實例:

library(tidyr) 
df3 <- df2 %>% 
    gather(day, ign, Mon:Sat) %>% 
    filter(ign != "") %>% 
    select(-ign) 
head(df3) 
# ID road   beginTime    endTime day 
# 1 555 758 2016-06-21 13:55:00 2016-06-21 15:45:00 Mon 
# 2 555 759 2016-06-21 10:40:00 2016-06-21 12:30:00 Mon 
# 3 444 762 2016-06-21 16:30:00 2016-06-21 19:15:00 Mon 
# 4 222 765 2016-06-21 11:45:00 2016-06-21 14:30:00 Mon 
# 5 333 767 2016-06-21 08:30:00 2016-06-21 11:15:00 Mon 
# 6 333 770 2016-06-21 11:45:00 2016-06-21 13:10:00 Mon 

現在,我將它們分組,並計算時間的等待:

df4 <- df3 %>% 
    arrange(ID, day, beginTime) %>% 
    group_by(ID, day) %>% 
    mutate(
    waitTime = difftime(beginTime, dplyr::lag(endTime, default = beginTime[1]), units='secs') 
) 
head(df4) 
# Source: local data frame [6 x 6] 
# Groups: ID, day [5] 
#  ID road   beginTime    endTime day  waitTime 
# <int> <int>    <time>    <time> <chr> <S3: difftime> 
# 1 222 765 2016-06-21 11:45:00 2016-06-21 14:30:00 Mon   0 secs 
# 2 222 766 2016-06-21 18:00:00 2016-06-21 21:40:00 Thu   0 secs 
# 3 222 765 2016-06-21 11:45:00 2016-06-21 14:30:00 Wed   0 secs 
# 4 333 767 2016-06-21 08:30:00 2016-06-21 11:15:00 Mon   0 secs 
# 5 333 770 2016-06-21 11:45:00 2016-06-21 13:10:00 Mon  1800 secs 
# 6 333 768 2016-06-21 08:30:00 2016-06-21 11:15:00 Thu   0 secs 

您可以輕鬆地過濾那些時期,當有人與等待:

df4 %>% 
    filter(waitTime > 0) 
# Source: local data frame [8 x 6] 
# Groups: ID, day [8] 
#  ID road   beginTime    endTime day  waitTime 
# <int> <int>    <time>    <time> <chr> <S3: difftime> 
# 1 333 770 2016-06-21 11:45:00 2016-06-21 13:10:00 Mon  1800 secs 
# 2 333 769 2016-06-21 13:25:00 2016-06-21 14:50:00 Thu  7800 secs 
# 3 333 769 2016-06-21 13:25:00 2016-06-21 14:50:00 Tue  7800 secs 
# 4 333 770 2016-06-21 11:45:00 2016-06-21 13:10:00 Wed  1800 secs 
# 5 444 761 2016-06-21 15:00:00 2016-06-21 16:25:00 Thu  12300 secs 
# 6 444 761 2016-06-21 15:00:00 2016-06-21 16:25:00 Tue  12300 secs 
# 7 555 758 2016-06-21 13:55:00 2016-06-21 15:45:00 Mon  5100 secs 
# 8 555 758 2016-06-21 13:55:00 2016-06-21 15:45:00 Wed  5100 secs 

在這種情況下,你會看到你在週一和週三ID 555的例子有1.41小時值(5100sec)破,ID 666沒有等待時間。