2017-08-25 67 views
0

我知道現在已經有很多關於「求和」的問題,但是,我沒有解決我的問題。下面是它:面板數據 - 按組進行求和並創建新變量

DF1是我的簡化的數據集

> df1 = data.table(Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010), 
        ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628), 
        category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"), 
        var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301)) 

DF2是期望的結果(見VAR2):

> df2 = data.table(Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010), 
        ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628), 
        category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"), 
        var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301), 
        var2= c(130,130,700,700,35,35,350,350,132,132,702,702)) 

所以我想計算的var1通過分組的款項ID和前兩個整數category

因此,如果變量類別的前兩個整數是09(或10個d等),然後根據組ID和前兩個整數category分配到var2的總和。然後,同一類別中的相同ID應分配相同的總和。

我試圖通過

> df1$var2 = rep(NA, rep(length(df1$ID))) 
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2) 
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1) 

但這裏的款項未分配到正確的項目achiev這一點。

有人能幫我嗎?

+2

請花些時間格式化您的代碼。 – lmo

+0

你可以通過突出顯示你的代碼和Ctrl + K來做到這一點 – useR

回答

1
df1 = data.frame(Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010), 
        ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628), 
        category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"), 
        var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301)) 

我在OP的原始數據框中添加了NA值,以反映他期望的完整規範。

df1$category_sub = substr(df1$category, 1, 2) 
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum) 
names(df1_aggre)[3] = "var2" 

df2 = merge(df1, df1_aggre, all=TRUE) 
df2[order(df2$Year),] 

結果:

> df2[order(df2$Year),] 
    ID category_sub Year category var1 var2 
1 1621   09 2009  0910 60 60 
4 1621   <NA> 2009  <NA> 70 NA 
5 1628   09 2009  0911 400 700 
6 1628   09 2009  0913 300 700 
9 3101   09 2009  0914 15 35 
10 3101   09 2009  0910 20 35 
11 3105   09 2009  0910 200 200 
12 3105   <NA> 2009  <NA> 150 NA 
2 1621   10 2010  1014 61 132 
3 1621   10 2010  1012 71 132 
7 1628   10 2010  1013 301 301 
8 1628   <NA> 2010  <NA> 401 NA 

我首先從category提取的第一兩個整數和由IDcategory_sub分組var1。然後我重新命名爲var1var2併合並df1df1_aggreIDcategory_suball=TRUE選項。這指定了一個完整的外連接。由此產生的數據幀未排序,所以我排序df2Year以獲得所需的結果。

+0

如果你需要把第一個整數作爲第一個整數(substr(category,1,2)而不是Year),你將如何處理這個問題? – Enrico

+0

@Enrico你的意思是從'category'中提取前兩個整數,然後將它們分組? – useR

+0

這是由於其他一些原因,我沒有包括在這裏:一些ID在類別中有一個缺失的值,這些值應該從總和中排除。那些差距我並不包括簡化數據集df1。 – Enrico

相關問題