2011-06-09 55 views
1

是否有會,讓來自DF這樣說子集大數據幀

vec <- data.frame(Names = c("var1","var2","var3","var4","var5","var6","var7", 
          "var8","var9","var10","var11","var12","var13", 
          "var14") , 
        phase1= runif(14), 
        phase1.away= runif(14), 
        phase1_in= runif(14), 
        phase1_out= runif(14), 
        phase1.1= runif(14), 
        phase1.away.1= runif(14), 
        phase1_in.1= runif(14), 
        phase1_out.1= runif(14), 
        phase1.2= runif(14), 
        phase1.away.2= runif(14), 
        phase1_in.2= runif(14), 
        phase1_out.2= runif(14)) 

賦予了新的DF因爲這樣一個快速和巧妙的方法:

-allways根據phase1.x訂購,給與值相對應的變量的名稱phase1_in和phase1_out值,但不包含phase1.away。

我在做什麼是根本

vec.o<-vec[with(vec, order(-phase1)),] 
d1<-vec.o[c("Names","phase1","phase1_in","phase1_out")] 

vec.o<-vec[with(vec, order(-phase1.1)),] 
d2<-vec.o[c("Names","phase1.1","phase1_in.1","phase1_out.1")] 

cbind(d1,d2) 

這是非常枯燥,我也相信反R-ISH。任何聰明的想法?我正在處理大數據幀永久和R似乎 有點麻煩。有沒有什麼好的文獻可以推薦用於這些目的? (負載許多變量,創建名字對他們來說,操作這些變量等...)

回答

3

編輯:爲案件phase.x修正去10及更高版本。

我相信你有相當多的比phase1.1,phase1.2多,所以使用正則表達式一般的解決辦法是沿着線的東西:

# Make an id vector for the phase1.x, and give Names id -1 
# gives a warning as character is transformed to NA 
id <- as.numeric(gsub(".*\\.(\\d+$)","\\1",names(vec))) 
id[1] <- -1 
id[is.na(id)] <- 0 # first occurence, no .x 


veclist <- lapply(unique(id)[-1],function(i){ 
    #select all variables necessary, exclude the away 
    out <- vec[id %in% c(i,-1) & !grepl("away",names(vec))] 
    # find the phase1.x for ordering 
    ovec <- grepl("phase1(\\.\\d+)?$",names(out)) 
    # order and produce 
    out[order(out[,ovec]),] 
}) 

do.call(cbind,veclist) 

它的基礎是承認的最後一個以點開頭的數字,並將其切除。如果沒有最後一個數字前面有一個點,它可能是Names變量(我用-1表示)或第一個階段(我用0表示)。

現在你有一個id向量,可以很容易地選擇屬於一起的變量,所以你可以遍歷id的唯一值,除了第一個(爲-1)。再次使用正則表達式,您可以獲得想要構建新數據框的任何變量。最後的do.call再次組合了所有這些數據幀。

順便說一下,排序子數據框比先排序原始數據框然後選擇變量要快得多。這是你在解決nullglob方面的收穫。

+0

不錯,儘管看起來phase.x的階段是階段10或更高階段,但ID會產生零點 – Alex 2011-06-09 12:25:24

+0

@Alex:很好的捕獲。我糾正包括phase.x高於10。 – 2011-06-09 12:48:55

1

這不是特別聰明,但它是超過兩倍的速度(根據我的簡單的基準):

o1 <- order(-vec$phase1) 
o2 <- order(-vec$phase1.1) 
cbind(vec[o1,c("Names","phase1","phase1_in","phase1_out")], 
     vec[o2,c("Names","phase1.1","phase1_in.1","phase1_out.1")]) 

基準是在這裏:

> n <- 2e5 
> vec<-data.frame(Names = as.character(runif(n)) , 
+     phase1= runif(n), 
+     phase1.away= runif(n), 
+     phase1_in= runif(n), 
+     phase1_out= runif(n), 
+     phase1.1= runif(n), 
+     phase1.away.1= runif(n), 
+     phase1_in.1= runif(n), 
+     phase1_out.1= runif(n), 
+     phase1.2= runif(n), 
+     phase1.away.2= runif(n), 
+     phase1_in.2= runif(n), 
+     phase1_out.2= runif(n)) 
> 
> 
> test1 <- function(){ 
+ vec.o<-vec[with(vec, order(-phase1)),] 
+ d1<-vec.o[c("Names","phase1","phase1_in","phase1_out")] 
+ vec.o<-vec[with(vec, order(-phase1.1)),] 
+ d2<-vec.o[c("Names","phase1.1","phase1_in.1","phase1_out.1")] 
+ d3 <- cbind(d1,d2) 
+ } 
> system.time(test1()) 
    user system elapsed 
    1.764 0.048 1.811 
> 
> 
> test2 <- function(){ 
+ o1 <- order(-vec$phase1) 
+ o2 <- order(-vec$phase1.1) 
+ d4 <- cbind(vec[o1,c("Names","phase1","phase1_in","phase1_out")], 
+    vec[o2,c("Names","phase1.1","phase1_in.1","phase1_out.1")]) 
+ } 
> system.time(test2()) 
    user system elapsed 
    0.736 0.056 0.791 
+0

謝謝,但我有260個階段的數據框,這是什麼使我最感興趣,因爲我想避免手動輸入 – Alex 2011-06-09 11:55:23

+0

您不需要使用列名來選擇列;你可以使用列索引,這可能會更快,更容易輸入。 – nullglob 2011-06-09 12:24:40