2
我有一個data.frame有兩個變量和一個因子列。然後我計算這個數據的一個子集,並且想重新排序剩餘的因素。我找到了下面的解決方案。但實際數字會減慢。那麼我如何重新排列我的因子?如何重新排列data.frame子集的因子並將其應用到主data.frame?
這裏有一個一步一步的例子:
library(plyr)
library(ggplot2)
# generate an example data.frame
# x and y are integers, l is a factor
df <- data.frame(x=rep(1:5, each=4), y=seq(1:5), l=factor(letters[seq(from = 1, to = 10)]))
df <- df[seq(1:17),]
df
x y l
1 1 1 a
2 1 2 b
3 1 3 c
4 1 4 d
5 2 5 e
6 2 1 f
7 2 2 g
8 2 3 h
9 3 4 i
10 3 5 j
11 3 1 a
12 3 2 b
13 4 3 c
14 4 4 d
15 4 5 e
16 4 1 f
17 5 2 g
現在我計算,我將用它來選擇DF的一個子集臨時data.frame:
# computing temporary data.frame
df2 <- ddply(df, .(l), summarize, sum=sum(y))
df2$pct <- df2$sum/sum(df2$sum) * 100
df2
l sum pct
1 a 2 4.166667
2 b 4 8.333333
3 c 6 12.500000
4 d 8 16.666667
5 e 10 20.833333
6 f 2 4.166667
7 g 4 8.333333
8 h 3 6.250000
9 i 4 8.333333
10 j 5 10.416667
# select only those letters with "high enough" y-value
df2.selected <- df2[df2$pct > 10,]
df2.selected
l sum pct
3 c 6 12.50000
4 d 8 16.66667
5 e 10 20.83333
10 j 5 10.41667
# use only those letters which occur in df2.selected$l
df.subset <- df[df$l %in% df2.selected$l,]
df.subset
x y l
3 1 3 c
4 1 4 d
5 2 5 e
10 3 5 j
13 4 3 c
14 4 4 d
15 4 5 e
我擺脫我的因子現在未使用的值:
# get rid of unused values of l
df.subset$l <- factor(df.subset$l)
str(df.subset)
'data.frame': 7 obs. of 3 variables:
$ x: int 1 1 2 3 4 4 4
$ y: int 3 4 5 5 3 4 5
$ l: Factor w/ 4 levels "c","d","e","j": 1 2 3 4 1 2 3
我的子集facotr的新順序應該是這個(我需要這個facet_以下包裝):
# the new order of the factor variable should be the (inverse) order of sum
df2.selected <- df2.selected[order(-df2.selected$sum),]
df2.selected
l sum pct
5 e 10 20.83333
4 d 8 16.66667
3 c 6 12.50000
10 j 5 10.41667
# that should be the new order of the factor variable l: e, d, c, j
# get rid of unused values of l
df2.selected$l <- factor(df2.selected$l)
df2.selected
l sum pct
5 e 10 20.83333
4 d 8 16.66667
3 c 6 12.50000
10 j 5 10.41667
str(df2.selected)
'data.frame': 4 obs. of 3 variables:
$ l : Factor w/ 4 levels "c","d","e","j": 3 2 1 4
$ sum: int 10 8 6 5
$ pct: num 20.8 16.7 12.5 10.4
# Here I need the order e, f, c, j!
ggplot(data=df.subset, aes(x=x, y=y)) + geom_point() + facet_wrap(~l)
# so merged both -- This is the problem. It's too expensive. Is there a better way?
df.merged <- merge(df.subset, df2.selected, by=c('l'))
df.merged$l <- reorder(df.merged$l, -df.merged$sum)
df.merged
l x y sum pct
1 c 1 3 6 12.50000
2 c 4 3 6 12.50000
3 d 1 4 8 16.66667
4 d 4 4 8 16.66667
5 e 2 5 10 20.83333
6 e 4 5 10 20.83333
7 j 3 5 5 10.41667
str(df.merged)
'data.frame': 7 obs. of 5 variables:
$ l : Factor w/ 4 levels "e","d","c","j": 3 3 2 2 1 1 4
..- attr(*, "scores")= num [1:4(1d)] -6 -8 -10 -5
.. ..- attr(*, "dimnames")=List of 1
.. .. ..$ : chr "c" "d" "e" "j"
$ x : int 1 4 1 4 2 4 3
$ y : int 3 3 4 4 5 5 5
$ sum: int 6 6 8 8 10 10 5
$ pct: num 12.5 12.5 16.7 16.7 20.8 ...
ggplot(data=df.merged, aes(x=x, y=y)) + geom_point() + facet_wrap(~l)
謝謝,這似乎是做它應該做的。但我必須逐步評估這些命令。 :-)我已經集中在plyr上了。 – JerryWho