2012-05-31 107 views
2

我有一個很大的數據集。我想分成「n」個數據集,每個數據集的大小都是「s」。然而,如果數字不能被數字整除,最後的數據集可能會小於其他大小。並將它們作爲csv文件輸出到工作目錄。將數據集拆分成多個數據集,其中隨機列r

比方說,下面的小例子:

set.seed(1234) 
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13)) 
mydf 

    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 
1 3 7 1 9 6 4 7 5 8 2 2 2 8 
2 5 3 4 6 9 5 3 10 5 8 10 2 10 
3 4 6 10 4 4 6 3 4 2 9 9 2 9 
4 10 10 9 4 3 7 7 7 10 6 7 10 2 
5 10 3 9 3 2 10 9 6 4 4 4 6 3 
6 7 2 8 7 5 5 10 10 9 3 7 8 4 
7 3 2 2 7 10 9 2 2 10 1 1 10 4 
8 3 9 9 7 3 1 7 6 10 3 10 3 2 
9 9 3 6 9 3 2 2 3 4 2 9 10 10 
10 6 4 3 3 5 9 3 9 10 7 4 6 10 

我希望創建一個在隨機拆分數據集爲N個子集(在這種情況下說的大小爲3,因爲有13列的功能 - 最後的數據集將有1列休息4每個都有3)並作爲單獨的數據集輸出爲文本文件。

這裏是我做過什麼:

set.seed(123) 
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE) 
# just crazy manual divide 
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9] 
group4 <- reshuffled[10:12]; group5 <- reshuffled[13] 

# just manual 
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on; 
# I want to write dimension of dataset at fist row of each dataset 
cat (dim(data1)) 
write.csv(data1, "data1.csv"); write.csv(data2, "data2.csv"); .....so on 

是否可以循環的過程中,我不得不產生100個集?

回答

1

也許有一個更清潔,更簡單的解決方案,但你可以嘗試以下方法:

mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13)) 

## Number of columns for each sub-dataset 
size <- 3 

nb.cols <- ncol(mydf) 
nb.groups <- nb.cols %/% size 
reshuffled <- sample.int(nb.cols, replace=FALSE) 
groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size)) 
dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE]) 

for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep="")) 
1

只是爲了好玩,很可能慢於朱巴

mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13)) 
size <- 3 
by(t(mydf), 
    INDICES=sample(as.numeric(gl((ncol(mydf) %/% size) + 1, size, ncol(mydf))), 
        ncol(mydf), 
        replace=FALSE), 
    FUN=function(x) write.csv(t(x), paste(rownames(x), collapse='-'), row.names=F)) 
0

爲了在劃分「是myDF」 n幾乎相等的部分,我從 這個問題和相應的答案中獲得靈感: link

它創建最小分區與最大分區之間的差異儘可能小的分區大小。在這個例子中該差等於1。實施例:

分區方法1 - 使用「floor'功能(這裏示出沒有可再現的代碼)。對於前6次迭代,通過隨後的樣本底板(100/7)= 14個索引將7個幾乎相等的部分/加數分成100行。第七個元素是餘數。這產生了:

14,14,14,14,14,14,16和= 100,最大差= 2

分區方法2 - 使用「ceiling'功能(這裏示出沒有可再現的代碼)。使用 'ceiling'-函數而不是' floor'-函數給出了類似的結果:

15,15,15,15,15,15,10和= 100,最大差= 5

分區方法3 - 使用上述參考中的公式。當使用的分區大小的以下步驟,所述載體( 'sequence_diff')是:

14,14,14,15,14,14,15,和= 100,最大差= 1

R-代碼:

set.seed(1234) 
#I increased the number of rows in the data frame to 100 
mydf <- data.frame (matrix(sample(x = 1:100, size = 1300, replace = TRUE), 
        ncol = 13)) 

index_list  <- list()  #Will store the indices for all partitions 
indices   <- 1:nrow(mydf) #Initially contains all indices for the dataset 'mydf' 
numb_partitions <- 7   #Specifies the number of partitions 

sequence <- floor(((nrow(mydf)*1:numb_partitions)/numb_partitions)) 
sequence <- c(0, sequence) 

#'sequence_diff' will contain the number of instances for each partition. 
sequence_diff <- vector() 
for(j in 1:numb_partitions){ 
    sequence_diff[j] <- sequence[j+1] - sequence[j] 
} 

#Inspect 'sequence_diff' and verify it's elements sum up to the total 
#number of rows in 'mydf' (100). 
> sequence_diff 
[1] 14 14 14 15 14 14 15 
> sum(sequence_diff) 
[1] 100 #Correct! 

for(i in 1:numb_partitions){ 

    #Use a different seed for each sampling iteration. 
    set.seed(seed = i) 

    #Sample from object 'indices' of size 1/'numb_partitions' 
    indices_partition <- sample(x = indices, 
           size = sequence_diff[i], 
           replace = FALSE) 

    #Remove the selected indices from 'indices' so these indices will not be 
    #selected in successive iterations. 
    indices   <- setdiff(x = indices, y = indices_partition) 

    #Store the indices for the i-th iteration in the list 'index_list'. This 
    #is just to verify later that 
    #the procedure has divided all indices in 'numb_partitions' disjunct sets. 
    index_list[[i]] <- indices_partition 

    #Dynamically create a new object that is named 'mydfx' in which x is the 
    #i-th partition. 
    assign(x = paste0("mydf", i), value = mydf[indices_partition,]) 

    write.csv(x = get(x = paste0("mydf", i)), #Dynamically get the object from environment. 
      file = paste0("mydf", i,".csv"), #Dynamically assgin a name to the csv-file. 
      sep = ",", 
      col.names = T, 
      row.names = FALSE  
} 

#Check whether all index subsets are mutually exclusive: union should have 100 
#unique elements. 
length(unique(unlist(index_list))) 
[1] 100 #Correct!