2016-04-24 69 views
1

我有這個數據集至極的結構是這樣集結數據集「忽略」分類變量

Neighborhood, var1, var2, COUNTRY, DAY, categ 1, categ 2 
    1   700  724  AL  0  YES YES 
    1   500  200  FR  0  YES  NO 
    .... 
    1   701  659  IT  1  NO  YES 
    1   791  669  IT  1  NO  YES 
    .... 
    2   239  222  GE  0  YES  NO 

等等...

從而使hyerarchy是「鄰居>日>國家」和對於每一個鄰里,每一天,對於每個國家我都有觀察到的var1,var2,categ1和categ2

我對分析該國的時刻不感興趣,所以我想要做的就是聚合(通過將「國家」字段var1和var2,分類變量「疊加」在一起ables categ1和categ2不受國家影響),並有一個數據集,爲每個鄰里和每一天給我的信息var1,var2,categ1和categ2

我很新的R編程和基本上不知道很多包(我會用C++寫一個程序,但我強迫自己學習R)... 那麼你有什麼想法如何做到這一點?

數據

df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L), 
         var1 = c(700L, 500L, 701L, 791L, 239L), 
         var2 = c(724L, 200L, 659L, 669L, 222L), 
         COUNTRY = c("AL", "FR", "IT", "IT", "GE"), 
         DAY = c(0L, 0L, 1L, 1L, 0L), 
         `categ 1` = c("YES", "YES", "NO", "NO", "YES"), 
         `categ 2` = c("YES", "NO", "YES", "YES", "NO")), 
       .Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"), 
       class = "data.frame", row.names = c(NA, -5L)) 

編輯:@akrun

當我嘗試你的命令,結果是:!

骨料(〜鄰居+日+ COUNTRY,數據= DF1 [grepl ( 「^ CATEG」,名字(DF1)),平均)

 Neighborhood, DAY, COUNTRY, var1, var2 

1   1  0  AL  700 724 
2   1  0  FR  500 200 
3   2  0  GE  239 222 
4   1  1  IT  746 664 

但(在這個例子中)我想擁有的是:

  Neighborhood, DAY, var1, var2 

1   1   0  1200 924   //wher var1=700+500.... 
2   1   1  1492 1328 
3   2   0  239 222 
+0

你想'集合(。〜鄰居+ DAY + COUNTRY,data = df1 [!grepl(「^ categ,names(df1))],mean)' – akrun

+0

不,我對在這個例子中,我應該這樣做: aggregate(。〜Neighborhood + DAY,data = df1 [!grepl(「^ COUNTRY,names(df1)) ],總和) 對不對? – user5609462

+0

不,那不行。如果您對categ列感興趣,是否包含在分組列中? – akrun

回答

1

如果我們不感興趣的 'CATEG' 欄目中,我們可以grep出來,並使用aggregate

aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum) 
# Neighborhood DAY var1 var2 
#1   1 0 1200 924 
#2   2 0 239 222 
#3   1 1 1492 1328 

或者使用dplyr

library(dplyr) 
df1 %>% 
    group_by(Neighborhood, DAY) %>% 
    summarise_each(funs(sum), matches("^var")) 
# Neighborhood DAY var1 var2 
#   (int) (int) (int) (int) 
#1   1  0 1200 924 
#2   1  1 1492 1328 
#3   2  0 239 222 
+1

'aggregate(cbind(var1,var2)〜COUNTRY,df1,mean)'' – rawr