下面應該讓你開始。你基本上需要做兩件事:子集和聚合。我將演示一個基本的R解決方案和一個data.table
解決方案。
首先,一些樣本數據。
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
讓我們從「國家」和「年」「稅前」值的總平均值,但只適用於2001年至2005年
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
下面是data.table
同樣的事情
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583
歡迎來到SO,這個問題在此之前已經被詢問了很多次。嘗試http://stackoverflow.com/questions/8225621/faster-way-to-create-variable-that-aggregates-a-column-by-id例如 – mnel
歡迎來到SO。 @mnel是正確的 - 關於聚合的問題已經在這裏多次提出。對於您的問題,您必須同時彙總和分類您的數據。您可以稍後彙總第一個和子集,或者,如果您的數據集非常大,則先選擇子集,然後彙總(這是我在答案中演示的內容,但不是那麼喜歡通過實驗遠離您學習的樂趣)。此外,作爲您的必讀書籍,這裏的大多數用戶通常會對[可重現的示例]做出更快速的響應(http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)。 – A5C1D2H2I1M1N2O1R2T1