2016-02-05 75 views
6

我有這樣一個表,傳播VS dcast

> head(dt2) 
    Weight Height Fitted interval limit value 
1 65.6 174.0 71.91200  pred lwr 53.73165 
2 80.7 193.5 91.63237  pred lwr 73.33198 
3 72.6 186.5 84.55326  pred lwr 66.31751 
4 78.8 187.2 85.26117  pred lwr 67.02004 
5 74.8 181.5 79.49675  pred lwr 61.29244 
6 86.4 184.0 82.02501  pred lwr 63.80652 

我希望它有這樣的,

> head(reshape2::dcast(dt2, 
     Weight + Height + Fitted + interval ~ limit, 
     fun.aggregate = mean)) 
    Weight Height Fitted interval  lwr  upr 
1 42.0 153.4 51.07920  conf 49.15463 53.00376 
2 42.0 153.4 51.07920  pred 32.82122 69.33717 
3 43.2 160.0 57.75378  conf 56.35240 59.15516 
4 43.2 160.0 57.75378  pred 39.54352 75.96404 
5 44.8 149.5 47.13512  conf 44.87642 49.39382 
6 44.8 149.5 47.13512  pred 28.83891 65.43133 

但使用tidyr::spread,我怎麼能這樣做?

我所用,

> tidyr::spread(dt2, limit, value) 

但得到的錯誤,

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601) 

隨機10行::

> dt[sample(nrow(dt), 10), ] 
    Weight Height Fitted interval limit value 
1253 52.2 162.5 60.28203  conf upr 61.51087 
426 49.1 158.8 56.54022  pred upr 74.75756 
1117 78.4 184.5 82.53066  conf lwr 80.98778 
1171 85.9 166.4 64.22611  conf lwr 63.21254 
948 61.4 177.8 75.75494  conf lwr 74.66393 
384 90.9 172.7 70.59731  pred lwr 52.41828 
289 75.9 172.7 70.59731  pred lwr 52.41828 
3  44.8 149.5 47.13512  pred lwr 28.83891 
774 87.3 182.9 80.91258  pred upr 99.12445 
772 86.4 175.3 73.22669  pred upr 91.40919 
+0

你例子並不包含在''limit' upr',也不'conf'在'interval',這意味着你的預期結果是不可再生 – mtoto

+0

爲什麼不保持長格式,只是聚合?請參閱[此處爲示例](http://stackoverflow.com/a/32795497/2204410),其中包含基礎R,* dplyr *和* data.table *。 – Jaap

+0

雖然我已經用dcast完成了,但我想用tidyr來學習東西。 @mtoto這只是我的數據集的一個頭,我會編輯它給你一個隨機樣本,以獲得可重複性。 – TheRimalaya

回答

9

比方說,你已經開始與這個樣子數據:

mydf 
# Weight Height Fitted interval limit value 
# 1  42 153.4 51.0792  conf lwr 49.15463 
# 2  42 153.4 51.0792  pred lwr 32.82122 
# 3  42 153.4 51.0792  conf upr 53.00376 
# 4  42 153.4 51.0792  pred upr 69.33717 
# 5  42 153.4 51.0792  conf lwr 60.00000 
# 6  42 153.4 51.0792  pred lwr 90.00000 

請注意分組列(1到5)的行5和行6中的重複項。這實際上是「tidyr」告訴你的。第一行和第五行是重複的,第二行和第六行也是重複的。

tidyr::spread(mydf, limit, value) 
# Error: Duplicate identifiers for rows (1, 5), (2, 6) 

正如@Jaap所建議的,解決方案是首先「彙總」數據。由於「tidyr」僅用於重新塑造數據(與「reshape2」聚合和重新塑造不同),因此需要在更改數據表單之前使用「dplyr」執行聚合。在這裏,我已經爲summarise做了「價值」列。

如果您在步驟summarise處停止執行,您會發現我們原來的6行數據集已經「收縮」到4行。現在,spread將按預期工作。

mydf %>% 
    group_by(Weight, Height, Fitted, interval, limit) %>% 
    summarise(value = mean(value)) %>% 
    spread(limit, value) 
# Source: local data frame [2 x 6] 
# 
# Weight Height Fitted interval  lwr  upr 
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl) 
# 1  42 153.4 51.0792  conf 54.57731 53.00376 
# 2  42 153.4 51.0792  pred 61.41061 69.33717 

這與從dcastfun.aggregate = mean預期的輸出。

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean) 
# Weight Height Fitted interval  lwr  upr 
# 1  42 153.4 51.0792  conf 54.57731 53.00376 
# 2  42 153.4 51.0792  pred 61.41061 69.33717 

的樣本數據:

mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
    153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,   
    51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",   
    "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,    
    2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),    
     value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,   
     90)), .Names = c("Weight", "Height", "Fitted", "interval",  
    "limit", "value"), row.names = c(NA, 6L), class = "data.frame") 
+0

謝謝!我正在考慮如何處理聚合函數。我想Hadely希望'tidyr'和'dplyr'一起使用。 – TheRimalaya

+0

這是一個很好的答案,讓我明白了'dcast'和'spread'之間的區別。謝謝! – Mikko

1

這裏有data.table替代dplyr。使用Ananda的答案中的mydf

library(data.table) 
library(magrittr) 
library(tidyr) 

DT <- data.table(mydf) 

首先,您可以使用by來計算每個限制的平均值。

DT[, .(lwr = mean(value[limit == "lwr"]), 
     upr = mean(value[limit == "upr"])), 
    by = .(Weight, Height, Fitted, interval)] 

如果limit == ...看起來太硬編碼,可以先聚合成一個長格式,然後spread。這是有效的,因爲一旦你聚合,就沒有重複。

DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>% 
    spread(key = "limit", value = "value") 

都讓你

# Weight Height Fitted interval  lwr  upr 
#1:  42 153.4 51.0792  conf 54.57731 53.00376 
#2:  42 153.4 51.0792  pred 61.41061 69.33717 
+0

謝謝,其實我是在說'dplyr'和'tidyr'。我已經用'reshape2'解決了這個問題,但我想知道如何使用這些特定的軟件包。不管怎麼說,還是要謝謝你! – TheRimalaya