最快的方式重塑變量值

我有大約3百萬行的數據集和結構如下列：最快的方式重塑變量值

PatientID| Year | PrimaryConditionGroup 
--------------------------------------- 
1  | Y1 | TRAUMA 
1  | Y1 | PREGNANCY 
2  | Y2 | SEIZURE 
3  | Y1 | TRAUMA

是相當新的，以R，我有一些很難找到重塑數據的正確方法到結構概述如下：

PatientID| Year | TRAUMA | PREGNANCY | SEIZURE 
---------------------------------------------- 
1  | Y1 | 1  | 1   | 0 
2  | Y2 | 0  | 0   | 1 
3  | Y1 | 1  | 0   | 1

我的問題是：什麼是最快/最優雅的方式來創建一個data.frame，其中PrimaryConditionGroup值變爲列，由PatientID和年分組（計數數量出現次數）？

來源

2011-11-15 Matt

可能有這樣做的更簡潔的方式，但對於純粹的速度，這是很難被擊敗data.table爲基礎的解決方案：

df <- read.table(text="PatientID Year PrimaryConditionGroup 
1   Y1 TRAUMA 
1   Y1 PREGNANCY 
2   Y2 SEIZURE 
3   Y1 TRAUMA", header=T) 

library(data.table) 
dt <- data.table(df, key=c("PatientID", "Year")) 

dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"), 
      PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"), 
      SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")), 
    by = list(PatientID, Year)] 

#  PatientID Year TRAUMA PREGNANCY SEIZURE 
# [1,]   1 Y1  1   1  0 
# [2,]   2 Y2  0   0  1 
# [3,]   3 Y1  1   0  0

編輯：aggregate()提供了「基礎R」的解決方案，可能或者可能不會更習慣。（唯一的複雜之處在於聚合返回一個矩陣，而不是一個data.frame;下面固定第二行向上。）

out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table) 
out <- cbind(out[1:2], data.frame(out[3][[1]]))

第二EDIT最後，使用reshape封裝的簡潔的解決方案獲得您同一個地方。

library(reshape) 
mdf <- melt(df, id=c("PatientID", "Year")) 
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)

來源

2011-11-15 20:06:03

+1'ddply'的輸入方式不會太少，實際上它會慢很多。 – joran

爲什麼你甚至會考慮ddply這個問題？ – hadley

嗨喬希，謝謝，這個按預期工作，表現很好。什麼是重塑數據的最簡潔/慣用的方式（如果性能不是問題） – Matt

有C實現快速melt和dcast data.table具體方法，在>=1.9.0版本。以下是與@ Josh發佈的300萬行數據的其他優秀答案的比較（不包括base ::: aggregate，因爲它需要相當長的一段時間）。

有關NEWS條目的更多信息，請轉至here。

我假設你有1000名患者，總共5年。您可以相應地調整變量patients和year。

require(data.table) ## >= 1.9.0 
require(reshape2) 

set.seed(1L) 
patients = 1000L 
year = 5L 
n = 3e6L 
condn = c("TRAUMA", "PREGNANCY", "SEIZURE") 

# dummy data 
DT <- data.table(PatientID = sample(patients, n, TRUE), 
       Year = sample(year, n, TRUE), 
       PrimaryConditionGroup = sample(condn, n, TRUE)) 

DT_dcast <- function(DT) { 
    dcast.data.table(DT, PatientID ~ Year, fun.aggregate=length) 
} 

reshape2_dcast <- function(DT) { 
    reshape2:::dcast(DT, PatientID ~ Year, fun.aggregate=length) 
} 

DT_raw <- function(DT) { 
    DT[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"), 
      PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"), 
       SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")), 
    by = list(PatientID, Year)] 
} 

# system.time(.) timed 3 times 
#   Method Time_rep1 Time_rep2 Time_rep3 
#  dcast_DT  0.393  0.399  0.396 
# reshape2_DT  3.784  3.457  3.605 
#   DT_raw  0.647  0.680  0.657

dcast.data.table爲約1.6倍比正常更快聚合使用data.table和8.8倍比reshape2:::dcast更快。

來源

2014-03-13 10:39:15 Arun

最快的方式重塑變量值

回答

相關問題