2017-10-05 53 views
1

我有一個需要分成多個平衡集的大型數據集。如何在基於多個變量的R中創建平衡集

該組看起來像以下:

> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8) 
> colnames(data)<-c("A","B","C","D","E","F","G","H") 

的集合,每一含有例如20行,將需要在多個變量被平衡,使得每個子集結束有B的類似的平均, C,D與其他所有子集相比包含在他們的子組中。

有沒有辦法做到這一點與R?任何意見將不勝感激。先謝謝你!

回答

0
library(tidyverse) 

# Reproducible data 
set.seed(2) 
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8) 
colnames(data)<-c("A","B","C","D","E","F","G","H") 

data=as.data.frame(data) 

更新回答

,如果你想從一個給定的行保留意見起來這也許不可能每一列內跨組得到類似手段。有8列(如您的樣本數據),您需要25個20行集合,其中每列A集具有相同的均值,每列B集具有相同的均值等。這是很多約束條件。然而,可能存在的算法可以找到設置的成員資格分配計劃,其使得集合平均值的差異最小化。

不過,如果你可以分別採取從每列20層的意見,而不到排它是從哪裏來的方面,那麼這裏有一個選項:

# Group into sets with same means 
same_means = data %>% 
    gather(key, value) %>% 
    arrange(value) %>% 
    group_by(key) %>% 
    mutate(set = c(rep(1:25, 10), rep(25:1, 10))) 

# Check means by set for each column 
same_means %>% 
    group_by(key, set) %>% 
    summarise(mean=mean(value)) %>% 
    spread(key, mean) %>% as.data.frame 
set  A  B  C  D  E  F  G  H 
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926 
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850 
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192 
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789 
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532 
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757 
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317 
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262 
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587 
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263 
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362 
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733 
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425 
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608 
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968 
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474 
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162 
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829 
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192 
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665 
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728 
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290 
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836 
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182 
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630 

原來的答案

# Randomly group data into 20-row groups 
set.seed(104) 
data = data %>% 
    mutate(set = sample(rep(1:(500/20), each=20))) 

head(data) 
  A  B   C  D  E   F  G   H set 
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10 
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20 
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19 
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25 
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19 
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5 
# Mean by set for each column 
data %>% group_by(set) %>% 
    summarise_all(mean) 
 set  A  B  C  D  E  F  G  H 
1  1 5.240236 6.143941 4.638874 5.367626 4.982008 4.20.521844 5.083868 
2  2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491 
3  3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547 
4  4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672 
5  5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151 
6  6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648 
7  7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660 
8  8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198 
9  9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806 
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670 
# ... with 15 more rows 

如果數據幀的總行數是不是你在每一組所需的行數整除,那麼你可以做,當你創建以下集合:

data = data %>% 
    mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()]) 

在這種情況下,該組的大小將改變與數據行的數目的比特不是通過在每個組行的期望數量整除。

+0

謝謝,但不是這個代碼爲每個組隨機選擇20行嗎?我想爲每個組選擇20行,以便列的「B」值的平均值在所有組中保持不變。 –

+0

您想要按行選擇還是可以爲每列分別選擇20個值? – eipi10

+0

不,每行都有對應於單個項目的值,並且我試圖通過使用某些列來對這些項目進行平衡分組。 –

相關問題