2017-09-25 70 views
5

我有一個自行車軌跡的樣本數據集。我的目標是要弄清楚,平均的時間量,在訪問B站間的失誤下一條記錄的索引

到目前爲止,我已經能夠簡單地訂購數據集:

test[order(test$starttime, decreasing = FALSE),] 

,並找到哪裏start_stationend_station相等B.

which(test$start_station == 'B') 
which(test$end_station == 'B') 

接下來的部分是,我遇到麻煩了行索引。爲了計算的時間流逝中,當自行車是在站B之間,我們必須在那裏start_station = "B"(自行車葉)之間的difftime()下一個出現的記錄其中end_station= "B"即使記錄恰好是在同一行(見第6行)。

用下面的數據集,我們知道,自行車7:30:0016:00:00外站B和18:00:00以30分鐘18:30:00外站的B,19:00:00 210之間分鐘,22:30:00外站的B,之間花了510分鐘這平均值爲250 minutes.

如何使用difftime()在R中重現此輸出?

> test 
    bikeid start_station   starttime end_station    endtime 
1  1    A 2017-09-25 01:00:00   B 2017-09-25 01:30:00 
2  1    B 2017-09-25 07:30:00   C 2017-09-25 08:00:00 
3  1    C 2017-09-25 10:00:00   A 2017-09-25 10:30:00 
4  1    A 2017-09-25 13:00:00   C 2017-09-25 13:30:00 
5  1    C 2017-09-25 15:30:00   B 2017-09-25 16:00:00 
6  1    B 2017-09-25 18:00:00   B 2017-09-25 18:30:00 
7  1    B 2017-09-25 19:00:00   A 2017-09-25 19:30:00 
8  1    А 2017-09-25 20:00:00   C 2017-09-25 20:30:00 
9  1    C 2017-09-25 22:00:00   B 2017-09-25 22:30:00 
10  1    B 2017-09-25 23:00:00   C 2017-09-25 23:30:00 

這裏是樣本數據:

> dput(test) 
structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A", 
"B", "C", "A", "C", "B", "B", "А", "C", "B"), starttime = structure(c(1506315600, 
1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400, 
1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt" 
), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B", 
"A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800, 
1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800, 
1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid", 
"start_station", "starttime", "end_station", "endtime"), row.names = c(NA, 
-10L), class = "data.frame") 
+2

第一步將轉換爲長格式,如'library(data.table); mtest = melt(setDT(test),id =「bikeid」,meas = patterns(「_ station」,「time」), variable.name =「event」,value.name = c(「station」,「time」 )); (factor:(1:2),c(「start」,「end」)),on =。(event),event:= i.V2]; 'setkey(mtest,bikeid,time)',但我不確定之後的最佳方式。 – Frank

回答

1

這將計算與要求在它發生的順序不同,但它不追加到data.frame

lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"]) 

[[1]] 
Time difference of 510 mins 

[[2]] 
Time difference of 30 mins 

[[3]] 
Time difference of 210 mins 

[[4]] 
Time difference of NA mins 

要計算平均時間:

v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"]) 
mean(v1, na.rm = TRUE) 

[1] 250 
+0

謝謝,這個方法有效。你能簡單地解釋'function(x,et)'是如何工作的嗎? –

+0

'lapply'允許將多個參數傳遞給函數。 'x'的值是'starttime',而'et'是函數之後定義的附加參數。這是爲了使參數只定義一次,但可以在函數中使用兩次。 – manotheshark

1

另一種可能性:

library(data.table) 
d <- setDT(test)[ , { 
    start = starttime[start_station == "B"] 
    end = endtime[end_station == "B"] 
    .(start = start, end = end, duration = difftime(end, start, units = "min")) 
} 
, by = .(trip = cumsum(start_station == "B"))] 
d 
# trip    start     end duration 
# 1: 0    <NA> 2017-09-25 01:30:00 NA mins 
# 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins 
# 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins 
# 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins 
# 5: 4 2017-09-25 23:00:00    <NA> NA mins 


d[ , mean(duration, na.rm = TRUE)] 
# Time difference of 250 mins 

# or 
d[ , mean(as.integer(duration), na.rm = TRUE)] 
# [1] 250 

的數據由它通過1各自行車從「B」(by = cumsum(start_station == "B"))開始時間增加的計數器分組。

相關問題