爲交叉驗證生成集

如何自動分割使用R進行5次交叉驗證的矩陣？我其實想生成5套（test_matrix_indices，train matrix_indices）。爲交叉驗證生成集

2011-09-13 Delphine

請不要混淆你的問題。這變得令人困惑。如果你想回答你自己的問題，那麼請在新的答案中這樣做。 – Andrie

對於K倍交叉驗證，您必須合併K-1個子集作爲訓練集，並留下一個作爲測試（重複K次），所以這不是針對您的問題的完整解決方案。 –

我已將答案放入答案部分。 – Delphine

f_K_fold <- function(Nobs,K=5){ 
    rs <- runif(Nobs) 
    id <- seq(Nobs)[order(rs)] 
    k <- as.integer(Nobs*seq(1,K-1)/K) 
    k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE) 
    k[,1] <- k[,1]+1 
    l <- lapply(seq.int(K),function(x,k,d) 
       list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))], 
        test=d[seq(k[x,1],k[x,2])]),k=k,d=id) 
    return(l) 
}

來源

2011-09-13 15:08:57

這是一個優雅的解決方案。謝謝。 – Delphine

此外，這種解決方案可以通過添加set.seed（n） – Delphine

什麼id d？我沒有明白。 – LoveMeow

我想你想矩陣行是分裂的情況。然後，所有你需要的是sample和split：

X <- matrix(rnorm(1000),ncol=5) 
id <- sample(1:5,nrow(X),replace=TRUE) 
ListX <- split(x,id) # gives you a list with the 5 matrices 
X[id==2,] # gives you the second matrix

我會用列表的工作，因爲它可以讓你做這樣的事情：

names(ListX) <- c("Train1","Train2","Train3","Test1","Test2") 
mean(ListX$Train3)

這使得代碼更易於閱讀，並使您不會在工作區中創建大量矩陣。如果您將矩陣分別放置在工作區中，您一定會搞砸。使用列表！

如果你想測試矩陣是比其他的更小或更大，使用prob說法sample：

id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))

給你一個測試矩陣那是火車矩陣大小的兩倍。

如果您想確定確切的病例數，則sample和prob不是最佳選擇。你可以使用這樣的技巧：

indices <- rep(1:5,c(100,20,20,20,40)) 
id <- sample(indices)

得到分別爲100，20，...和40的矩陣。

來源

2011-09-13 13:23:05

+1分裂 - 我真的想知道爲自己生成交叉驗證矩陣，這是完美的。 – richiemorrisroe

joris偉大的代碼謝謝。是不是有交叉驗證的想法，你循環遍歷所有集合，並使用每個組作爲測試數據至少一次，這將打敗使用列表的目的，並像你這樣命名它？ – appleLover

@appleLover列表的使用僅僅是爲了避免在工作區中生成單個矩陣。這是爲了保持一切。交叉驗證和自舉有多種方法，根據方法，您需要對統計信息進行不同的更正。我只是給出了一種方法來有組織地創建這些矩陣。 –

解決方案，而分裂：

set.seed(7402313) 
X <- matrix(rnorm(999), ncol=3) 
k <- 5 # number of folds 

# Generating random indices 
id <- sample(rep(seq_len(k), length.out=nrow(X))) 
table(id) 
# 1 2 3 4 5 
# 67 67 67 66 66 

# lapply over them: 
indicies <- lapply(seq_len(k), function(a) list(
    test_matrix_indices = which(id==a), 
    train_matrix_indices = which(id!=a) 
)) 
str(indicies) 
# List of 5 
# $ :List of 2 
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ... 
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ... 
# $ :List of 2 
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ... 
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ... 
# $ :List of 2 
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ... 
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ... 
# $ :List of 2 
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ... 
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ... 
# $ :List of 2 
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ... 
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...

但是你可以返回矩陣太：

matrices <- lapply(seq_len(k), function(a) list(
    test_matrix = X[id==a, ], 
    train_matrix = X[id!=a, ] 
)) 
str(matrices) 
List of 5 
# $ :List of 2 
    # ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ... 
    # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ... 
# $ :List of 2 
    # ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ... 
    # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ... 
# $ :List of 2 
    # ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ... 
    # ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ... 
# $ :List of 2 
    # ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ... 
    # ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ... 
# $ :List of 2 
    # ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ... 
    # ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...

那麼你可以使用lapply得到的結果：

lapply(matrices, function(x) { 
    m <- build_model(x$train_matrix) 
    performance(m, x$test_matrix) 
})

編輯：與Wojc相比iech的解決方案：

f_K_fold <- function(Nobs, K=5){ 
    id <- sample(rep(seq.int(K), length.out=Nobs)) 
    l <- lapply(seq.int(K), function(x) list(
     train = which(x!=id), 
     test = which(x==id) 
    )) 
    return(l) 
}

來源

2011-09-13 14:13:43 Marek

編輯：謝謝您的回答。我已經發現了以下溶液（http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf）：

n <- nrow(mydata) 
K <- 5 
size <- n %/% K 
set.seed(5) 
rdm <- runif(n) 
ranked <- rank(rdm) 
block <- (ranked-1) %/% size+1 
block <- as.factor(block)

然後我使用：

for (k in 1:K) { 
    matrix_train<-matrix[block!=k,] 
    matrix_test<-matrix[block==k,] 
    [Algorithm sequence] 
    }

以便爲每次迭代生成足夠的集合。

但是，這種解決方案可以省略一個人進行測試。我不推薦它。

來源

2011-09-14 08:47:22 Delphine

不需要創建單獨的data.frames /矩陣，你只需要保留一個整數序列，id存儲每個摺疊的混洗索引。

X <- read.csv('data.csv') 

k = 5 # number of folds 
fold_size <-nrow(X)/k 
indices <- rep(1:k,rep(fold_size,k)) 
id <- sample(indices, replace = FALSE) # random draws without replacement 

log_models <- new.env(hash=T, parent=emptyenv()) 
for (i in 1:k){ 
    train <- X[id != i,] 
    test <- X[id == i,] 
    # run algorithm, e.g. logistic regression 
    log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train) 
}

來源

2014-08-27 11:35:33 Rhubarb

請注意，當nrow（X）不是k的倍數時，會丟棄一些樣本。 – Samuel

sperrorest包提供了這種能力。您可以選擇隨機拆分（partition.cv()），空間拆分（partition.kmeans()）或基於因子級別拆分（partition.factor.cv()）。後者目前僅在Github版本中提供。

例子：

library(sperrorest) 
data(ecuador) 

## non-spatial cross-validation: 
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1) 

# first repetition, second fold, test set indices: 
idx <- resamp[['1']][[2]]$test 

# test sample used in this particular repetition and fold: 
ecuador[idx , ]

如果你有一個空間數據集（與coords）使用，也可以想像你產生褶皺

# this may take some time... 
plot(resamp, ecuador)

交叉驗證然後可以使用sperrorest()（順序）或parsperrorest()（並行）進行。

來源

2016-12-23 21:14:22

爲交叉驗證生成集

回答

相關問題