如何有效地按組分組樣本數據？

我的確有類似的問題，在this question中有解釋。類似於這個問題，我有一個數據框有3列（id，組，值）。我想從每個組中取出n個樣本並進行替換，並生成一個更小的數據框，每組有n個樣本。如何有效地按組分組樣本數據？

但是，我在模擬代碼中執行了數百個子樣本，並且基於ddply的解決方案在我的代碼中使用非常緩慢。我試圖重寫一段簡單的代碼，以查看是否可以獲得更好的性能，但是它仍然很慢（如果不是更糟，則不會比ddply解決方案更好）。以下是我的代碼。我想知道，如果它可以對性能

#Producing example DataFrame 
dfsize <- 10 
groupsize <- 7 
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize)) 


#Main function for subsampling 
sample.from.group<- function(df, dfgroup, size, replace){ 
    outputsize <- 1 
    newdf <-df # assuming a sample cannot be larger than the original 
    uniquegroups <- unique(dfgroup) 
    for (uniquegroup in uniquegroups){ 
    dataforgroup <- which(dfgroup==uniquegroup) 
    mysubsample <- df[sample(dataforgroup, size, replace),] 
    sizeofsample <- nrow(mysubsample) 
    newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample 
    outputsize <- outputsize + sizeofsample 
    } 
    return(newdf[1:(outputsize-1),]) 
} 

#Using the function 
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)

來源

2013-06-04 Mark

使df成爲矩陣。矩陣子集化比數據框架子集化要快得多。此外，通常data.table比plyr更快（取決於您的數據大小達數量級）。 – Roland

這裏的兩位plyr基礎的解決方案：

library(plyr) 

dfsize <- 1e4 
groupsize <- 7 
testdf <- data.frame(
    id = seq_len(dfsize), 
    group = rep(1:groupsize, length = dfsize), 
    junkdata = sample(1:10000, size = dfsize)) 

sample_by_group_1 <- function(df, dfgroup, size, replace) { 
    ddply(df, dfgroup, function(x) { 
    x[sample(nrow(df), size = size, replace = replace), , drop = FALSE] 
    }) 
} 

sample_by_group_2 <- function(df, dfgroup, size, replace) { 
    idx <- split_indices(df[[dfgroup]]) 
    subs <- lapply(idx, sample, size = size, replace = replace) 

    df[unlist(subs, use.names = FALSE), , drop = FALSE] 
} 

library(microbenchmark) 
microbenchmark(
    ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE), 
    plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE) 
) 

# Unit: microseconds 
# expr min lq median uq max neval 
# ddply 4488 4723 5059 5360 36606 100 
# plyr 443 487 507 536 31343 100

第二種方法是速度更快，因爲它可以在一個步驟中完成子集化 - 如果您能夠一步完成該操作，通常可以通過任何簡單的方式獲得更好的性能。

來源

2013-06-04 13:10:43 hadley

得到改善，我認爲這是更清潔，有可能更快：

z <- sapply(unique(test.frame.1$group), FUN= function(x){ 
      sample(which(test.frame.1$group==x), 100, TRUE) 
      }) 
out <- test.frame.1[z,] 
out

來源

2013-06-04 08:32:38 Thomas

非常感謝。我的代碼更快，但似乎沒有plyr更快。 – Mark

如何有效地按組分組樣本數據？

回答

相關問題