聚合函數的時間和組

我試圖建立一個使用時間每年和類型堆積酒吧。我的數據庫墊（頭）看起來像聚合函數的時間和組

head(mat) 

    year flights.type flights.duration 
1 2000   HR20   01:12:00 
2 2000   HR20   02:00:00 
3 2000   L4   00:54:00 
4 2000   L4   00:42:00 
5 2000   L4   00:22:00 
6 2000   HR20   00:24:00

我想按年和按類型來概括flights.duration然後構造一個堆疊欄。

我試圖使用函數聚合，但隨着時間的推移它不能正常工作。誰能幫我？按年份和按類型我和看起來像：

aggregate(mat$flights.duration,format(.POSIXct(mat$flights.duration,tz="GMT"), "%H:%M:%S"),FUN=sum, by=list(mat$year))

來源

2016-05-15 richpiana

你的問題之一是你沒有正確地打破「01:12:00」，類似於正確的時間組件。我使用的兩種方法是如果所有持續時間少於24小時並且使用posix函數作爲與午夜之間的差異，或者拉開這個變量並自己執行計算，則提供日期。時間序列包可能有一個更清晰的方法。 – lmo

感謝大家的寶貴意見和支持:) – richpiana

您可以將flights.duration列轉換爲數字分鐘值如下：

df$flights.duration <- apply(df, 1, function(x) { 
           sum(as.numeric(unlist(strsplit(x[3], ':'))) * c(60, 1, 0)) 
         })

然後，使用分組功能，例如一個從dplyr封裝如下：

library(dplyr) 
df %>% group_by(year, flights.type) %>% summarise(flights.duration = sum(flights.duration))

輸出將是如下：

Source: local data frame [2 x 3] 
Groups: year [?] 

    year flights.type flights.duration 
    <int>  <chr>   <dbl> 
1 2000   HR20    216 
2 2000   L4    118

編輯：

library(tidyr) 
library(dplyr) 
df %>% 
    separate(flights.duration, c('hours', 'mins', 'seconds'), ':') %>% 
    group_by(year, flights.type) %>% 
    summarise(flights.duration = sum(60 * as.numeric(hours) + 
            as.numeric(mins) + 
            as.numeric(seconds)/60))

結果是和以前一樣：

Source: local data frame [2 x 3] 
Groups: year [?] 

    year flights.type flights.duration 
    <int>  <chr>   <dbl> 
1 2000   HR20    216 
2 2000   L4    118

添加另一種選擇是 可能使用tidyr包的separate代替上述apply功能，它通過每一行循環更快

來源

2016-05-15 21:22:25 Gopala

使用data.table包裝和as.difftime()功能的解決方案：

library(data.table) 
setDT(mat)[, .(flights.duration.minutes = sum(as.difftime(as.character(flights.duration)))), 
       .(year, flights.type)] 

    year flights.type flights.duration.minutes 
1: 2000   HR20     216 mins 
2: 2000   L4     118 mins

來源

2016-05-15 22:35:24 Psidom

的lubridate包被廣泛認爲是在R.可用的最佳日期/時間包它建立在基礎R Date和POSIXct類型，並增加了其自身的Interval，Duration，和Period類型。

hh:mm:ss時間的最合適的數據類型是Period類型。理論上講，應該可以將您的字符串時間解析爲Period值，然後使用aggregate()執行直分組sum()。

不幸的是，這顯然是一個比人們希望的任務更困難的任務。我最終得到了它，但它需要一些扭曲。

首先，下面是如何將字符串時間解析爲Period值。lubridate提供了方便hms()的方法來做到這一點：

df <- data.frame(year=c(2000L,2000L,2000L,2000L,2000L,2000L),flights.type=c('HR20','HR20','L4','L4','L4','HR20'),flights.duration=c('01:12:00','02:00:00','00:54:00','00:42:00','00:22:00','00:24:00'),stringsAsFactors=F); 

library(lubridate); 
df$flights.duration <- hms(df$flights.duration); 

df; 
## year flights.type flights.duration 
## 1 2000   HR20  1H 12M 0S 
## 2 2000   HR20   2H 0M 0S 
## 3 2000   L4   54M 0S 
## 4 2000   L4   42M 0S 
## 5 2000   L4   22M 0S 
## 6 2000   HR20   24M 0S

其次，不幸的是，lubridate似乎並沒有提供對Period類型sum()方法：

sum(df$flights.duration); 
## [1] 0

（如果您想知道爲什麼它返回零，Period類型是通過存儲秒字段作爲矢量的有效負載來實現的，該矢量是雙類型的，並且其餘字段（分鐘秒，幾小時，幾天，幾月，幾年）存儲爲插槽，也是雙重類型。 df$flights.duration中的所有值都有零秒，基地sum()函數只能看到矢量有效載荷，所以它將總計爲零。）

我試圖用S3方法自己填補這個空白，但很快就發現它不會不起作用，因爲Period類型是S4類型。所以我寫了這個S4方法：

setMethod('sum',signature(x='Period',na.rm='logical'),function(x,na.rm=FALSE) period(seconds=sum(as.double(x),na.rm=na.rm),minutes=sum([email protected],na.rm=na.rm),hours=sum([email protected],na.rm=na.rm),days=sum([email protected],na.rm=na.rm),months=sum([email protected],na.rm=na.rm),years=sum([email protected],na.rm=na.rm))); 
## [1] "sum" 

sum(df$flights.duration); 
## [1] "3H 154M 0S"

不幸的是，仍然有一個問題：aggregate()嘗試簡化默認情況下，聚集的結果，而這拉平S4結果到非S4對象，失去了槽和破壞數據：

res <- aggregate(flights.duration~year+flights.type,df,sum); 
res; 
## Error in paste([email protected], "y ", [email protected], "m ", [email protected], "d ", [email protected], "H ", : 
## trying to get slot "year" from an object (class "Period") that is not an S4 object 
traceback(); 
## 8: paste([email protected], "y ", [email protected], "m ", [email protected], "d ", [email protected], "H ", 
##  [email protected], "M ", [email protected], "S", sep = "") 
## 7: format.Period(x[[i]], ..., justify = justify) 
## 6: format(x[[i]], ..., justify = justify) 
## 5: format.data.frame(x, digits = digits, na.encode = FALSE) 
## 4: as.matrix(format.data.frame(x, digits = digits, na.encode = FALSE)) 
## 3: print.data.frame(list(year = c(2000L, 2000L), flights.type = c("HR20", 
## "L4"), flights.duration = c(0, 0))) 
## 2: print(list(year = c(2000L, 2000L), flights.type = c("HR20", "L4" 
## ), flights.duration = c(0, 0))) 
## 1: print(list(year = c(2000L, 2000L), flights.type = c("HR20", "L4" 
## ), flights.duration = c(0, 0))) 
res$flights.duration; 
## [1] 0 0 
## attr(,"class") 
## [1] "Period" 
## attr(,"class")attr(,"package") 
## [1] "lubridate" 
isS4(res$flights.duration); 
## [1] FALSE

正如你所看到的，aggregate()調用成功，但對象已損壞。 print.data.frame()方法在該列上失敗，因爲它恰巧調用format()，該方法調度到S3方法format.Period()，這是一個名爲lubridate命名空間下的私有方法。它在損壞的對象上失敗。

我們可以防止簡單化：

res <- aggregate(flights.duration~year+flights.type,df,sum,simplify=F); 
res; 
## year flights.type flights.duration 
## 1 2000   HR20    0 
## 2 2000   L4    0 
res$flights.duration; 
## $`1` 
## [1] "3H 36M 0S" 
## 
## $`4` 
## [1] "118M 0S" 
##

因此在技術上它工作，但現在該列列表類型，這是不理想的。它也不能很好地顯示;當顯示爲data.frame的一部分時，我們只會看到一個零。

我們可以通過手動轉換列來組合列表組件來解決此問題。不幸的是，unlist()或do.call(c,...)明顯的方法不起作用：

res <- transform(aggregate(flights.duration~year+flights.type,df,sum,simplify=F),flights.duration=do.call(c,flights.duration)); 
res; 
## year flights.type flights.duration 
## 1 2000   HR20    0 
## 2 2000   L4    0 
res$flights.duration; 
## [1] 0 0 
isS4(res$flights.duration); 
## [1] FALSE

Period值的列表被夷爲平地，一個普通的載體，類似於aggregate()做了簡化的效果。

問題似乎是列表名稱，它阻止c()調用按預期行爲。我們可以用unname()來解決這個問題。因此，這裏的最終解決方案：

res <- transform(aggregate(flights.duration~year+flights.type,df,sum,simplify=F),flights.duration=do.call(c,unname(flights.duration))); 
res; 
## year flights.type flights.duration 
## 1 2000   HR20  3H 36M 0S 
## 2 2000   L4   118M 0S

所以，雖然我們到了那裏，最終，我不建議此解決方案。有太多的複雜性，功能上的差距，以及R生態系統不同派別之間不協調的相互作用。

來源

2016-05-16 02:04:48 bgoldst

聚合函數的時間和組

回答

相關問題