另一種方式:
# assuming the data.frame is already sorted by
# all three columns (unfortunately, this is a requirement)
> sequence(rle(do.call(paste, df))$lengths)
# [1] 1 1 1 2 3 1 1 1 1 2
細分:
do.call(paste, df) # pastes each row of df together with default separator "space"
# [1] "A M 32" "A M 33" "A F 35" "A F 35" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"
# [10] "B F 28"
rle(.) # gets the run length vector
# Run Length Encoding
# lengths: int [1:7] 1 1 3 1 1 1 2
# values : chr [1:7] "A M 32" "A M 33" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"
$lengths # get the run-lengths (as opposed to values)
# [1] 1 1 3 1 1 1 2
sequence(.) # get 1:n for each n
# [1] 1 1 1 2 3 1 1 1 1 2
標杆:
由於有相當多的解決方案,我想我這個基準上比較龐大data.frame
。所以,這裏是結果(我還添加了一個解決方案data.table
)。
這裏的數據:
require(data.table)
require(plyr)
set.seed(45)
length <- 1e3 # number of rows in `df`
df <- data.frame(Place = sample(letters[1:20], length, replace=T),
Sex = sample(c("M", "F"), length, replace=T),
Length = sample(1:75, length, replace=T))
df <- df[with(df, order(Place, Sex, Length)), ]
阿南達的ave
解決方案:
AVE_FUN <- function(x) {
i <- interaction(x)
within(x, {
ID <- ave(as.character(i), i, FUN = seq_along)
})
}
Arun的rle
解決方案:
RLE_FUN <- function(x) {
x <- transform(x, ID = sequence(rle(do.call(paste, df))$lengths))
}
本的plyr
解決方案:
PLYR_FUN <- function(x) {
ddply(x, c("Place", "Sex", "Length"), transform, ID = seq_along(Length))
}
最後,data.table
溶液:
DT_FUN <- function(x) {
dt <- data.table(x)
dt[, ID := seq_along(.I), by=names(dt)]
}
基準代碼:
require(rbenchmark)
benchmark(d1 <- AVE_FUN(df),
d2 <- RLE_FUN(df),
d3 <- PLYR_FUN(df),
d4 <- DT_FUN(df),
replications = 5, order = "elapsed")
結果:
隨着length = 1e3
(編號在數據行。幀DF)
# test replications elapsed relative user.self
# 2 d2 <- RLE_FUN(df) 5 0.013 1.000 0.013
# 4 d4 <- DT_FUN(df) 5 0.017 1.308 0.016
# 1 d1 <- AVE_FUN(df) 5 0.052 4.000 0.052
# 3 d3 <- PLYR_FUN(df) 5 4.629 356.077 4.452
隨着length = 1e4
:
# test replications elapsed relative user.self
# 4 d4 <- DT_FUN(df) 5 0.033 1.000 0.031
# 2 d2 <- RLE_FUN(df) 5 0.089 2.697 0.088
# 1 d1 <- AVE_FUN(df) 5 0.102 3.091 0.100
# 3 d3 <- PLYR_FUN(df) 5 23.103 700.091 20.659
隨着length = 1e5
:
# test replications elapsed relative user.self
# 4 d4 <- DT_FUN(df) 5 0.179 1.000 0.130
# 1 d1 <- AVE_FUN(df) 5 1.001 5.592 0.940
# 2 d2 <- RLE_FUN(df) 5 1.098 6.134 1.011
# 3 d3 <- PLYR_FUN(df) 5 219.861 1228.274 147.545
觀察:我注意到的趨勢是,隨着越來越大的數據,data.table
(不奇怪)不最好的(規模真的很好),而ave
和rle
是非常接近的競爭對手第二名(ave
比rle
更好)。不幸的是,plyr
在所有數據集上表現都很差。
注意:Ananda的解決方案給出了character
的輸出結果,我將它保留在基準測試中。
不包括在函數中創建data.table。 'dt [,ID:= seq_len(.N),by = names(DT)]'也可能更快 – mnel 2013-03-07 22:48:40
@mnel,我記得MatthewDowle提到類似於包含',key(。)' (http://stackoverflow.com/questions/15182888/complicated-reshaping)。我認爲它應該包括根據對裏卡多基準的評論創建data.table。 – Arun 2013-03-07 22:55:49
如果你想利用排序和加速從設置密鑰的優勢,然後包括setkey(),但你不在這種情況下,所以我認爲這是不公平的開銷(而不是公平的開銷,如果你設置密鑰) – mnel 2013-03-07 22:59:18