2014-02-20 93 views
2

我有一個data.frame有兩個變量和一個因子列。然後我計算這個數據的一個子集,並且想重新排序剩餘的因素。我找到了下面的解決方案。但實際數字會減慢。那麼我如何重新排列我的因子?如何重新排列data.frame子集的因子並將其應用到主data.frame?

這裏有一個一步一步的例子:

 library(plyr) 
    library(ggplot2) 
    # generate an example data.frame 
    # x and y are integers, l is a factor 
    df <- data.frame(x=rep(1:5, each=4), y=seq(1:5), l=factor(letters[seq(from = 1, to = 10)])) 
    df <- df[seq(1:17),] 
    df 
     x y l 
    1 1 1 a 
    2 1 2 b 
    3 1 3 c 
    4 1 4 d 
    5 2 5 e 
    6 2 1 f 
    7 2 2 g 
    8 2 3 h 
    9 3 4 i 
    10 3 5 j 
    11 3 1 a 
    12 3 2 b 
    13 4 3 c 
    14 4 4 d 
    15 4 5 e 
    16 4 1 f 
    17 5 2 g 

現在我計算,我將用它來選擇DF的一個子集臨時data.frame:

 # computing temporary data.frame 
     df2 <- ddply(df, .(l), summarize, sum=sum(y)) 
     df2$pct <- df2$sum/sum(df2$sum) * 100 
     df2 
     l sum  pct 
    1 a 2 4.166667 
    2 b 4 8.333333 
    3 c 6 12.500000 
    4 d 8 16.666667 
    5 e 10 20.833333 
    6 f 2 4.166667 
    7 g 4 8.333333 
    8 h 3 6.250000 
    9 i 4 8.333333 
    10 j 5 10.416667 
    # select only those letters with "high enough" y-value 
    df2.selected <- df2[df2$pct > 10,] 
    df2.selected 
     l sum  pct 
    3 c 6 12.50000 
    4 d 8 16.66667 
    5 e 10 20.83333 
    10 j 5 10.41667 
    # use only those letters which occur in df2.selected$l 
    df.subset <- df[df$l %in% df2.selected$l,] 
    df.subset 
     x y l 
    3 1 3 c 
    4 1 4 d 
    5 2 5 e 
    10 3 5 j 
    13 4 3 c 
    14 4 4 d 
    15 4 5 e 

我擺脫我的因子現在未使用的值:

 # get rid of unused values of l 
     df.subset$l <- factor(df.subset$l) 
     str(df.subset) 
    'data.frame': 7 obs. of 3 variables: 
    $ x: int 1 1 2 3 4 4 4 
    $ y: int 3 4 5 5 3 4 5 
    $ l: Factor w/ 4 levels "c","d","e","j": 1 2 3 4 1 2 3 

我的子集facotr的新順序應該是這個(我需要這個facet_以下包裝):

 # the new order of the factor variable should be the (inverse) order of sum 
     df2.selected <- df2.selected[order(-df2.selected$sum),] 
     df2.selected 
     l sum  pct 
    5 e 10 20.83333 
    4 d 8 16.66667 
    3 c 6 12.50000 
    10 j 5 10.41667 
    # that should be the new order of the factor variable l: e, d, c, j 
    # get rid of unused values of l 
    df2.selected$l <- factor(df2.selected$l) 
    df2.selected 
     l sum  pct 
    5 e 10 20.83333 
    4 d 8 16.66667 
    3 c 6 12.50000 
    10 j 5 10.41667 
    str(df2.selected) 
    'data.frame': 4 obs. of 3 variables: 
    $ l : Factor w/ 4 levels "c","d","e","j": 3 2 1 4 
    $ sum: int 10 8 6 5 
    $ pct: num 20.8 16.7 12.5 10.4 


     # Here I need the order e, f, c, j! 
     ggplot(data=df.subset, aes(x=x, y=y)) + geom_point() + facet_wrap(~l) 
     # so merged both -- This is the problem. It's too expensive. Is there a better way? 
     df.merged <- merge(df.subset, df2.selected, by=c('l')) 
     df.merged$l <- reorder(df.merged$l, -df.merged$sum) 
     df.merged 
     l x y sum  pct 
    1 c 1 3 6 12.50000 
    2 c 4 3 6 12.50000 
    3 d 1 4 8 16.66667 
    4 d 4 4 8 16.66667 
    5 e 2 5 10 20.83333 
    6 e 4 5 10 20.83333 
    7 j 3 5 5 10.41667 
    str(df.merged) 
    'data.frame': 7 obs. of 5 variables: 
    $ l : Factor w/ 4 levels "e","d","c","j": 3 3 2 2 1 1 4 
     ..- attr(*, "scores")= num [1:4(1d)] -6 -8 -10 -5 
     .. ..- attr(*, "dimnames")=List of 1 
     .. .. ..$ : chr "c" "d" "e" "j" 
    $ x : int 1 4 1 4 2 4 3 
    $ y : int 3 3 4 4 5 5 5 
    $ sum: int 6 6 8 8 10 10 5 
    $ pct: num 12.5 12.5 16.7 16.7 20.8 ... 
     ggplot(data=df.merged, aes(x=x, y=y)) + geom_point() + facet_wrap(~l) 

回答

0

這裏是data.table一個解決方案,應該是比較快:

library(data.table) 
dt <- data.table(df, key="l") 
keep.lvls <- as.character(
    dt[, list(sum=sum(y)), by=l][, # get the sums for each group 
    pct:=sum/sum(sum) * 100][  # pct for each group 
    pct > 10][      # only keep those greater than 10 
    order(pct, decreasing=T), l] # order by pct, pull out `l` only 
) 
str(dt.final <- 
    dt[ 
    keep.lvls,][,      # only keep `keep.lvls` from `dt` 
    l:=factor(l, levels=keep.lvls)]) # reset factors on `dt` to have `keep.lvls` levels 

主要生產:

Classes ‘data.table’ and 'data.frame': 8 obs. of 3 variables: 
$ l: Factor w/ 4 levels "e","j","d","i": 1 1 2 2 3 3 4 4 
$ x: int 2 4 3 5 1 4 3 5 
$ y: int 5 5 5 5 4 4 4 4 
- attr(*, ".internal.selfref")=<externalptr> 

注意這些問題的答案都略有不同,你既然我們有不同的隨機數據。這是與set.seed(1)

+0

謝謝,這似乎是做它應該做的。但我必須逐步評估這些命令。 :-)我已經集中在plyr上了。 – JerryWho