通過data.table的唯一值分組

我有一個data.table超過130 000行。通過data.table的唯一值分組

我想分組兩個列：dates和progress由一個變量id並將值放在一個向量中，所以我用aggregate()。

df_agr <- aggregate(cbind(progress, dates) ~ id, data = df_test, FUN = c)

然而，它需要大約52秒彙總數據+我從山坳dates失去的日期格式。

數據幀的一個例子：

  id  dates progress 
1: 3505H6856 2003-07-10  yes 
2: 3505H6856 2003-08-21  yes 
3: 3505H6856 2003-09-04  yes 
4: 3505H6856 2003-10-16  yes 
5: 3505H67158 2003-01-14  yes 
6: 3505H67158 2003-02-18  yes 
7: 3505H67862 2003-03-06  yes 
8: 3505H62168 2003-04-24  no 
9: 3505H62168 2003-05-15  yes 
10: 3505H65277 2003-02-11  yes

結果我得到：

  id progress  dates 
1 3505H62168  1, 2  5, 6 
2 3505H65277   2   2 
3 3505H67158  2, 2  1, 3 
4 3505H67862   2   4 
5 3505H6856 2, 2, 2, 2  7, 8, 9, 10

我很驚訝地看到，一切都轉化成誰似乎包含了「一個integer +每行獨立「載體實際上是來自列表的載體：

'data.frame': 5 obs. of 3 variables: 
$ id  : chr "3505H62168" "3505H65277" "3505H67158" "3505H67862" ... 
$ progress:List of 5 
    ..$ 1: int 1 2 
    ..$ 2: int 2 
    ..$ 3: int 2 2 
    ..$ 4: int 2 
    ..$ 5: int 2 2 2 2 
$ dates :List of 5 
    ..$ 1: int 5 6 
    ..$ 2: int 2 
    ..$ 3: int 1 3 
    ..$ 4: int 4 
    ..$ 5: int 7 8 9 10

我試圖轉換回日期以正確的格式有：

lapply(df_agr$dates, function(x) as.Date(x, origin="1970-01-01"))

但我得到：

$`1` 
[1] "1970-01-06" "1970-01-07" 

$`2` 
[1] "1970-01-03" 

$`3` 
[1] "1970-01-02" "1970-01-04" 

$`4` 
[1] "1970-01-05" 

$`5` 
[1] "1970-01-08" "1970-01-09" "1970-01-10" "1970-01-11"

如此看來，因爲它是寫在文件中，也許是最低的起源不是"1970-01-01"來自數據的日期？

所以我的問題是：如何獲得與aggregate()相同的結果與data.table同時保持日期格式？

所以它意味着如何通過唯一的ID與data.table分組。我想：

setDT(df)[,list(col1 = c(progress), col2 = c(dates)), by = .(unique(id))]

但當然，我得到了遵循錯誤：

錯誤[.data.table（DF，列表（COL1 = C（進度），COL2 = C（日期）），：在「由」或「keyby」列表中的項的長度（5）每個必須是相同的長度，在x或返回的行的行數爲i （10）

數據：。

structure(list(id = c("3505H6856", "3505H6856", "3505H6856", 
"3505H6856", "3505H67158", "3505H67158", "3505H67862", "3505H62168", 
"3505H62168", "3505H65277"), dates = structure(c(12243, 12285, 
12299, 12341, 12066, 12101, 12117, 12166, 12187, 12094), class = "Date"), 
    progress = c("yes", "yes", "yes", "yes", "yes", "yes", "yes", 
    "no", "yes", "yes")), .Names = c("id", "dates", "progress" 
), class = c("data.frame"), row.names = c(NA, -10L 
))

來源

2017-05-10 Omlere

'通過=（ID）'的'，而不是通過=（唯一（ID））' –

@ErdemAkkas是的，但我想通過唯一的ID組。 – Omlere

您可以使用paste0我認爲如下，您需要更改日期爲字符，使其不coverted其數字對應的，以下運行查詢，而無需轉換日期的數字會給你像重視，12166 ，12187.在你的查詢中，你也使用「c」來組合對象，但是當你使用。（id）時，我們應該使用粘貼來組合，也可以在data.table中。除非你的查詢有一些東西不是唯一的，例如在這種情況下，如果你避免了崩潰聲明，你將不會得到ID上唯一的密鑰，我希望這是有幫助的。感謝：

df_agr <- aggregate(cbind(progress, as.character(dates)) ~ id, data = df, FUN = paste0) 

> df_agr 
      id   progress            V2 
1 3505H62168   no, yes       2003-04-24, 2003-05-15 
2 3505H65277    yes          2003-02-11 
3 3505H67158   yes, yes       2003-01-14, 2003-02-18 
4 3505H67862    yes          2003-03-06 
5 3505H6856 yes, yes, yes, yes 2003-07-10, 2003-08-21, 2003-09-04, 2003-10-16 
>

使用data.table:

setDT(df)[,.(paste0(progress,collapse=","), paste0(as.character(dates),collapse=",")), by = .(id)] 


      id    V1           V2 
1: 3505H6856 yes,yes,yes,yes 2003-07-10,2003-08-21,2003-09-04,2003-10-16 
2: 3505H67158   yes,yes      2003-01-14,2003-02-18 
3: 3505H67862    yes         2003-03-06 
4: 3505H62168   no,yes      2003-04-24,2003-05-15 
5: 3505H65277    yes         2003-02-11

或者只是指出了由大衛阿倫貝爾，在data.table更簡單的方法是，對寶貴的意見感謝：

setDT(df)[, lapply(.SD, toString), by = id]

來源

2017-05-10 10:55:47 PKumar

非常感謝。它完美的工作，它只需要大約2-3秒'data.table'而不是52秒'聚合' – Omlere

@DavidArenburg感謝您的反饋和意見，在解決方案中添加。 – PKumar

一個dplyr版本。

library(dplyr) 
df %>% 
    group_by(id) %>% 
    summarize (progress = paste(progress, collapse=","), 
       dates = paste(dates, collapse=",")) 

#   id  progress          dates 
#  <chr>   <chr>          <chr> 
# 1 3505H62168   no,yes      2003-04-24,2003-05-15 
# 2 3505H65277    yes         2003-02-11 
# 3 3505H67158   yes,yes      2003-01-14,2003-02-18 
# 4 3505H67862    yes         2003-03-06 
# 5 3505H6856 yes,yes,yes,yes 2003-07-10,2003-08-21,2003-09-04,2003-10-16

來源

2017-05-10 11:37:40 epi99

通過data.table的唯一值分組

回答

相關問題