由rcs提供的答案作品很簡單。不過,如果你正在處理更大的數據集,需要一個性能提升有一個更快的替代方案:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category])
# user system elapsed
# 0.008 0.001 0.009
我們來比較一下使用的數據是一樣的。框架和凌駕於:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
如果你想保持柱這是語法:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
的差異將成爲大數據集更明顯,如下面的代碼演示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time(data[,sum(Frequency),by=Category])
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.287 0.010 0.296
對於多個聚合,你可以結合lapply
和.SD
如下
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
@AndrewMcKinlay,R使用代字號來定義符號公式,用於統計和其他功能。它可以解釋爲*「按類別分類的頻率」*或*「頻率取決於類別」*。並非所有的語言都使用特殊的運算符來定義符號函數,如R所示。也許用波浪算子的「自然語言解釋」,它變得更有意義(甚至直覺)。我個人發現這個符號公式表示比一些更冗長的選擇更好。 – r2evans 2016-12-19 04:35:12