2011-07-09 82 views
0

我有一個由6列組成的數據框。第1到第5列每個都有不連續的名稱/值,如地區,年份,月份,年齡區間和性別。第六欄是該具體組合的死亡人數。從分層數據中提取特定數據R

   District Gender Year Month Age.Group Total.Deaths 
1    Eastern Female 2003  1  -1   0 
2    Eastern Female 2003  1  -2   2 
3    Eastern Female 2003  1   0   2 
4    Eastern Female 2003  1  01-4   1 
5    Eastern Female 2003  1  05-09   0 
6    Eastern Female 2003  1  10-14   1 
7    Eastern Female 2003  1  15-19   0 
8    Eastern Female 2003  1  20-24   4 
9    Eastern Female 2003  1  25-29   9 
10    Eastern Female 2003  1  30-34   3 
11    Eastern Female 2003  1  35-39   7 
12    Eastern Female 2003  1  40-44   5 
13    Eastern Female 2003  1  45-49   5 
14    Eastern Female 2003  1  50-54   8 
15    Eastern Female 2003  1  55-59   5 
16    Eastern Female 2003  1  60-64   4 
17    Eastern Female 2003  1  65-69   7 
18    Eastern Female 2003  1  70-74   8 
19    Eastern Female 2003  1  75-79   5 
20    Eastern Female 2003  1  80-84   10 
21    Eastern Female 2003  1  85+   11 
22    Eastern Female 2003  2  -1   0 
23    Eastern Female 2003  2  -2   0 
24    Eastern Female 2003  2   0   4 
25    Eastern Female 2003  2  01-4   1 
26    Eastern Female 2003  2  05-09   2 
27    Eastern Female 2003  2  10-14   2 
28    Eastern Female 2003  2  15-19   0 

我想從這個大數據框中過濾或提取較小的數據幀。例如,我想只有四個年齡組。這四個年齡組將分別包含:

Group 0: Consisting of Age.Group -1, -2 and 0. 
Group 1-4: Consisting of Age.Group 01-4 
Group 5-14: Consisting of Age.Group 05-09 and 10-14 
Group 15+: Consisting of Age.Group 15-19 to 85+ 

Total.Deaths將成爲這些組中的每個組的總和。

所以我希望它看起來像這樣

   District Gender Year Month Age.Group Total.Deaths 
1    Eastern Female 2003  1   0   4 
2    Eastern Female 2003  1  01-4   1 
3    Eastern Female 2003  1  05-14   1 
4    Eastern Female 2003  1  15+   104 
5    Eastern Female 2003  2   0   4 
6    Eastern Female 2003  2  01-4   1 
7    Eastern Female 2003  2  05-14   4 
8    Eastern Female 2003  2  15+   ... 

我有很多數據,並已經尋找了幾天,但無法找到的功能,以幫助將做到這一點。

回答

1

car軟件包中可能會有一種使用類似recode的軟件重新編碼您的年齡變量的方法,特別是因爲您已經方便地獲得了當前年齡變量編碼的字符級別良好的編碼。但對於只有幾級,我往往只是重新編寫他們用手工創建一個新的時代變了,這是一個好方法實踐中的R只是「得到的東西做」:

#Reading your data in from a text file that I made via copy/paste 
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE) 

#Make sure Age.Group is ordered and init new age variable 
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE) 
dat$AgeGroupNew <- rep(NA,nrow(dat)) 

#The recoding 
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0" 
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4" 
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14" 
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+" 

然後我們就可以使用生成的摘要ddplysummarise

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise, 
      TotalDeaths = sum(Total.Deaths)) 

起初我擔心,因爲我得到了91人死亡,而不是104,你所指出的,但是我算的手和91是正確的,我認爲。也許是一個錯字。

+0

非常感謝joran。它在我身邊工作 - 我學到了幾件R的東西。再次感謝你。對不起,104的錯字! – OSlOlSO