有這樣使用複製()函數,而不是計數組大小的更簡單,更effiecent方式。
首先,我們需要生成一個測試dastaset:
# Generate test datasets
smallNumberSampled <- 1e3
largeNumberSampled <- 1e6
smallDataset <- data.table(id=paste('id', 1:smallNumberSampled, sep='_'), value1=sample(x = 1:26, size = smallNumberSampled, replace = T), value2=letters[sample(x = 1:26, size = smallNumberSampled, replace = T)])
largeDataset <- data.table(id=paste('id', 1:largeNumberSampled, sep='_'), value1=sample(x = 1:26, size = largeNumberSampled, replace = T), value2=letters[sample(x = 1:26, size = largeNumberSampled, replace = T)])
# add 2 % duplicated rows:
smallDataset <- rbind(smallDataset, smallDataset[sample(x = 1:nrow(smallDataset), size = nrow(smallDataset)* 0.02)])
largeDataset <- rbind(largeDataset, largeDataset[sample(x = 1:nrow(largeDataset), size = nrow(largeDataset)* 0.02)])
然後我們實現三個解決方案的功能:
# Original suggestion
getDuplicatedRows_Count <- function(dt, columnName) {
dt[,n:=.N,by=columnName]
return(dt[n>1])
}
# Duplicated using subsetting
getDuplicatedRows_duplicated_subset <- function(dt, columnName) {
return(dt[which(duplicated(dt[,columnName, with=FALSE]) | duplicated(dt[,columnName, with=FALSE], fromLast = T) ),])
}
# Duplicated using the "by" argument to avoid copying
getDuplicatedRows_duplicated_by <- function(dt, columnName) {
return(dt[which(duplicated(dt[,by=columnName]) | duplicated(dt[,by=columnName], fromLast = T) ),])
}
然後我們測試,他們給了相同的結果
results1 <- getDuplicatedRows_Count (smallDataset, 'id')
results2 <- getDuplicatedRows_duplicated_subset(smallDataset, 'id')
results3 <- getDuplicatedRows_duplicated_by(smallDataset, 'id')
> identical(results1, results2)
[1] TRUE
> identical(results2, results3)
[1] TRUE
而我們時間3種解決方案的平均表現:
# Small dataset
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_Count (smallDataset, 'id')))/100
user system elapsed
0.00176 0.00007 0.00186
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_subset(smallDataset, 'id')))/100
user system elapsed
0.00206 0.00005 0.00221
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_by (smallDataset, 'id')))/100
user system elapsed
0.00141 0.00003 0.00147
#Large dataset
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_Count (largeDataset, 'id')))/100
user system elapsed
0.28571 0.01980 0.31022
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_subset(largeDataset, 'id')))/100
user system elapsed
0.24386 0.03596 0.28243
> system.time(temp <- replicate(n = 100, expr = getDuplicatedRows_duplicated_by (largeDataset, 'id')))/100
user system elapsed
0.22080 0.03918 0.26203
這表明duplicate()方法的縮放比例更好,尤其是在使用「by =」選項的情況下。
UPDATE:11月21日相同的輸出(如建議的阿倫 - 感謝)在2014年測試使用data.table v 1.9.2中發現的問題與我在哪裏,重複的fromLast不起作用。我更新到v 1.9.4並重新分析,現在差異小得多。
UPDATE:2014年11月26日。包含並測試了「by =」方法從data.table中提取列(正如Arun所建議的,因此信用額度已達到此值)。此外,對運行時間的測試在100次測試中取平均值,以確保結果的正確性。
當然!乾杯。 – fridaymeetssunday
或者,如果你不想分配一個新列:'exons.s [,c(.SD,n = .N),by = newID] [n> 1]' –
只要語法簡單,I認爲這是最自然的版本:'exons.s [,.SD [.N> 1],by = newID]' – eddi