我正在嘗試dcast
大型數據集(數百萬行)。我有一排到達時間和出發地,另一排出發時間和目的地。在兩種情況下都有一個id
來標識單位。它看起來類似於這樣:dcast有效地使用多個變量的大型數據集
id time movement origin dest
1 10/06/2011 15:54 ARR 15 15
1 10/06/2011 16:14 DEP 15 29
2 10/06/2011 17:59 ARR 73 73
2 10/06/2011 18:10 DEP 73 75
2 10/06/2011 21:10 ARR 75 75
2 10/06/2011 21:20 DEP 75 73
3 10/06/2011 17:14 ARR 17 17
3 10/06/2011 18:01 DEP 17 48
4 10/06/2011 17:14 ARR 49 49
4 10/06/2011 17:26 DEP 49 15
所以,我想重新分配對(ARR
- DEP
),有效地做到這一點(如here)。由於這是一個非常大的數據集,因此for loop
在這種情況下不起作用。理想的輸出將
index unitid origin arr time dest dep time
1 1 15 10/06/2011 14:33 29 10/06/2011 19:24
2 2 73 10/06/2011 14:59 75 10/06/2011 17:23
3 2 75 10/06/2011 21:10 73 10/06/2011 23:40
數據:
df <- structure(list(time = structure(c(7L, 16L, 8L, 11L, 18L, 20L,
10L, 12L, 3L, 6L, 15L, 19L, 9L, 4L, 5L, 14L, 1L, 2L, 13L, 17L
), .Label = c("10/06/2011 09:08", "10/06/2011 10:54", "10/06/2011 11:38",
"10/06/2011 12:41", "10/06/2011 12:54", "10/06/2011 14:26", "10/06/2011 14:33",
"10/06/2011 14:59", "10/06/2011 17:12", "10/06/2011 17:14", "10/06/2011 17:23",
"10/06/2011 18:56", "10/06/2011 19:03", "10/06/2011 19:04", "10/06/2011 19:16",
"10/06/2011 19:24", "10/06/2011 20:12", "10/06/2011 21:10", "10/06/2011 22:28",
"10/06/2011 23:40"), class = "factor"), movement = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 1L,
2L, 2L, 3L), .Label = c("ARR", "DEP", "ITZ"), class = "factor"),
origin = c(15L, 15L, 73L, 73L, 75L, 75L, 17L, 17L, 49L, 49L,
15L, 15L, 32L, 10L, 10L, 17L, 76L, 76L, 76L, 76L), dest = c(15L,
29L, 73L, 75L, 75L, 73L, 17L, 48L, 49L, 15L, 15L, 49L, 32L,
10L, 17L, 10L, 76L, 65L, 76L, 65L), id = c(1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 6L, 6L, 6L, 7L, 7L, 8L,
8L)), .Names = c("time", "movement", "origin", "dest", "id"
), row.names = c(NA, -20L), class = "data.frame")
也許你可以試試'從dcast.data.table' [1.9.8的分支 「data.table」( https://github.com/Rdatatable/data.table/tree/1.9.8)(但是期望事情可能會改變,因爲那還不是CRAN的版本 – A5C1D2H2I1M1N2O1R2T1 2014-12-19 10:45:41
嗨@AnandaMahto,如果我只是想挑選時間,此代碼(通過@akron)在'dcast.data.table(setDT(df)[,c('。id','Seq'):= list(c('arrival','departure')) gl(.N,2,.N))],id + Seq〜.id,value.var ='time')'但是,如果我想添加原點和des tination的信息,我真的不知道如何撿起來。請記住它是一個非常大的數據集(百萬行) – user3507584 2014-12-19 11:37:32
您能告訴我們您要處理的行數有多少? – Arun 2014-12-19 12:34:02