2013-02-21 34 views
0

我之前問過這個問題,並得到了答覆,爲我解決了它。我有一個數據幀,看起來像這樣:得到符合特定標準的值的平均值(模式匹配)

id        weekdays    halflife 
241732222300860000 Friday, Aug 31, 2012, 22 0.4166666667 
24168917Friday, Aug 31, 2012, 19 0.3833333333 
241686878137512000 Friday, Aug 31, 2012, 19 0.4 
241651117396738000 Friday, Aug 31, 2012, 16 1.5666666667 
241635163505820000 Friday, Aug 31, 2012, 15 0.95 
241633401382265000 Friday, Aug 31, 2012, 15 2.3666666667 

而且我想獲得的平均半衰期週一創建的項目,然後在週二...等。 (我的日期範圍跨越6個月)。

要獲得我使用的日期值strptimedifftime。此外,我發現max(df$halflife)的最大半衰期,我怎麼能找到它對應的ID?

重複性代碼:

structure(list(id = c(241732222300860416, 24168917, 
241686878137511936, 241651117396738048, 241635163505819648, 241633401382264832 
), weekdays = c("Friday, Aug 31, 2012, 22", "Friday, Aug 31, 2012, 19", 
"Friday, Aug 31, 2012, 19", "Friday, Aug 31, 2012, 16", "Friday, Aug 31, 2012, 15", 
"Friday, Aug 31, 2012, 15"), halflife = structure(c(0.416666666666667, 
0.383333333333333, 0.4, 1.56666666666667, 0.95, 2.36666666666667 
), class = "difftime", units = "mins")), .Names = c("id", 
"weekdays", "halflife"), row.names = c(NA, 6L), class = "data.frame") 

所以,現在,我有一個平均水平的一半生命值都在星期一,星期二...等。我怎樣才能得到這些工作日內所有小時的平均值,即:所有星期一上午9點​​,上午10點,然後上午11點創建的所有項目的平均半衰期。然後週二早上9點,上午10點,上午11點..等等。星期幾列中的日期格式化,以便逗號後面的最後一個數字是它創建時的小時數。我對正則表達式和模式匹配非常不滿,這就是爲什麼我要問這個後續問題。

回答

1

帶有基本包,你可以做以下。

> mydf 
      id     weekdays  halflife 
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins 
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins 
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins 
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins 
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins 
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins 

而不是使用正則表達式,我們可以只使用strsplitweekdaysunlist結果每個元素,它早在4列格式matrixcbind它帶回mydf

> mydf2 <- cbind(mydf, matrix(unlist(sapply(mydf$weekdays, strsplit, split=',')), byrow=TRUE, ncol=4, dimnames=list(1:nrow(mydf), c('Weekday', 'Day', 'Year', 'Hour')))) 
> mydf2 
      id     weekdays  halflife Weekday  Day Year Hour 
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins Friday Aug 31 2012 22 
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins Friday Aug 31 2012 19 
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins Friday Aug 31 2012 19 
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins Friday Aug 31 2012 16 
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins Friday Aug 31 2012 15 
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins Friday Aug 31 2012 15 

現在我們已經分手平日列得當,我們可以使用aggregate功能在需要的分組列來計算mean

> aggregate(halflife ~ Weekday, data=mydf2, FUN = mean) 
    Weekday halflife 
1 Friday 1.013889 

如果你想按Weekday以及Hour然後

> aggregate(halflife ~ Weekday + Hour, data=mydf2, FUN = mean) 
    Weekday Hour halflife 
1 Friday 15 1.6583333 
2 Friday 16 1.5666667 
3 Friday 19 0.3916667 
4 Friday 22 0.4166667 

如這裏aggregate功能,例如第一個參數是它支持一個forumla對象〜一,一〜很多很多〜一個,還有很多很多的關係。請參閱?aggregate示例以瞭解如何使用它。

我會舉幾個例子說明多對多的關係。

> set.seed(12345) 
> mydf2 <- cbind(mydf2, newvar = rnorm(nrow(mydf2))) 
> mydf2 
      id     weekdays  halflife Weekday  Day Year Hour  newvar 
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins Friday Aug 31 2012 22 0.5855288 
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins Friday Aug 31 2012 19 0.7094660 
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins Friday Aug 31 2012 19 -0.1093033 
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins Friday Aug 31 2012 16 -0.4534972 
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins Friday Aug 31 2012 15 0.6058875 
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins Friday Aug 31 2012 15 -1.8179560 
> aggregate(cbind(newvar,halflife) ~ Weekday + Hour, data=mydf2, FUN = mean) 
    Weekday Hour  newvar halflife 
1 Friday 15 -0.6060343 1.6583333 
2 Friday 16 -0.4534972 1.5666667 
3 Friday 19 0.3000814 0.3916667 
4 Friday 22 0.5855288 0.4166667