2015-07-22 77 views
1

我需要創建一個矩陣,其中包含有關數據集中其他變量的協變量值的矩陣。這是我當前的解決方案的一個自包含的示例使用c/C++或向量化加速條件性R循環

library(dplyr) 
library(survival) 
library(microbenchmark) 
data(heart, package = "survival") 
data <- heart 

# Number of unique subjects 
n.sub <- data %>% group_by(id) %>% n_groups() 
# Unique failure times 
fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
# Number of unique failure times 
n.fail.time <- length(fail.time) 

# Pre-fill matrix. Will be filled with covariate values. 
mat <- matrix(NA_real_, nrow = n.sub, ncol = n.fail.time) 
# Run loop 
for(i in 1:n.sub) { # Number of subjects 
    data.subject <- data[data$id == i, ] # subsetting here provides nice speed-up 
    for(j in 1:n.fail.time) { # Number of failure times. 
    value <- subset(data.subject, (start < fail.time[j]) & (stop >= fail.time[j]), select = transplant, drop = TRUE) 
    if(length(value) == 0) { # An early event or censor will return empty value. Assign to zero. 
     mat[i, j] <- 0 
    } 
    else { 
     mat[i, j] <- value # True value 
    } 
    } 
} 

對於包含數千個觀察值的數據集,這太慢了。我不知道如何使用R代碼來最好地實現這種矢量化,並且我不知道使用Rcpp足夠的c/C++。如何通過這些(或其他)選項加速這個示例?

看起來timereg package中的src/aalen.c文件可能有一個類似於我的問題的c解決方案。請參閱if ((start[c]<time) && (stop[c]>=time))附近的代碼。儘管這可能只是我對c /編程表現的無知。

+1

我真的不知道R,但我可以讀一些行,並告訴一些等價的C.有沒有辦法使用Rcpp來轉換庫? –

+0

一旦它在C中,矢量化部分不是太難。 –

+3

@JeremiahDicharry:不,沒有辦法自動轉換庫。 Rcpp不是一個R-to-C(++)編譯器,而是「僅僅」一個相當有用的膠片。有一些臨時演員(如vectorisation)。 –

回答

4

我正面臨類似的選擇,轉向C++來加速一個不同的問題,但我最終轉向了已經在C++中高效實現並使用這些R包。在這裏,你想要的是一個名爲data.table的軟件包。

如果您是R的新手,這可能很難遵循,但通過here的配置文件可以獲得data.table包的良好文檔。爲了瞭解下面發生的事情,您可能會逐步瞭解我測試過的簡化數據集的代碼(請參閱答案的底部),並在對象更改值時監視它們。提高速度的關鍵是使用data.table的快速賦值方法,並僅執行矢量化操作。

我的解決方案如下。請注意,我不確定是否需要0,1,2值,但我很樂意將代碼更改爲0,1,如果這符合您的預期。

require(data.table) 

dataDT <- data.table(data[, c("id", "start", "stop", "transplant")]) 
# add a serial number for each id 
dataDT[, idObs := 1:length(start), by = id ] 
# needed because transplant is a factor in the heart dataset 
dataDT[, transplant := as.integer(transplant)] 

# create a "long" format data.table of subjects, observation number, and start/stop times 
matDT <- data.table(subject = rep(1:n.sub, each = n.fail.time * max(dataDT$idObs)), 
        idObs = rep(1:max(dataDT$idObs), max(dataDT$idObs), n.sub * max(dataDT$idObs)), 
        fail.time = rep(fail.time, each = max(dataDT$idObs))) 

# merge in start and stop times 
setkey(matDT, subject, idObs) 
setkey(dataDT, id, idObs) 
matDT <- dataDT[matDT] 

# eliminate missings (for which no 2nd observation took place) 
matDT <- matDT[!is.na(transplant)] 

# this replicates the "value" assignment in the loop 
matDT[, value := transplant * ((start < fail.time) & (stop >= fail.time))] 

# sum on the ids by fail time 
matDT2 <- matDT[, list(matVal = sum(value)), by = list(id, fail.time)] 

# convert to a matrix 
mat2 <- matrix(matDT2$matVal, ncol = ncol(mat), byrow = TRUE, dimnames = list(1:n.sub, fail.time)) 

這是不是你的代碼快很多倍,根據microbenchmark(),其中第一種方法是從問題的代碼:

 min   lq  mean  median   uq  max neval 
310.503535 339.364159 396.287178 354.292829 406.937216 762.28838 100 
    7.113083 7.420517 9.436973 7.788479 9.426443 32.50355 100 

要顯示的輸出,我測試這對前六你的data對象的行。這提供了一個很好的例子,因爲第三和第四名患者(id = 3,4)在移植之前和之後都有兩個觀察值。

data <- heart[1:6, ] 

,然後我添加的行和列的標籤,以您的mat對象:

colnames(mat) <- fail.time 
rownames(mat) <- 1:n.sub 
mat 
## 6 16 39 50 
## 1 1 1 1 1 
## 2 1 0 0 0 
## 3 2 2 0 0 
## 4 1 1 2 0 

在這裏,您可以看到新mat2是相同的:

mat2 
## 6 16 39 50 
## 1 1 1 1 1 
## 2 1 0 0 0 
## 3 2 2 0 0 
## 4 1 1 2 0 
all.equal(mat, mat2) 
## [1] TRUE 
+0

這太好了,謝謝。我將你的解決方案轉換爲dplyr,因爲我對此更加熟悉,但是你的解決方案仍然是最快的! – kzoo

+1

什麼是「長度(開始)」?這不應該只是'dataDT [,idObs:= seq_len(.N),by = id]'?也不應該從一個因素的轉換應該'as.integer(as.character(移植))'? –

+0

@DavidArenburg我也質疑那部分。我將其解釋爲一種簡單的方法來獲取'id''可以重複的最大次數。你的代碼可能會更好。您的因子轉換會產生更直觀的結果,但對於因子爲字符串的情況,返回整數因子級別會更好。 – kzoo

3

這是一個dplyr版本@KenBenoit解決方案(請參閱dplyr.matrix函數)。以下是測試所有三種方法的代碼。

library(dplyr) 
library(data.table) 
library(survival) 
library(microbenchmark) 
data(heart, package = "survival") 
data <- heart 

old.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    # Pre-fill matrix. Will be filled with covariate values. 
    mat <- matrix(NA_real_, nrow = n.subjects, ncol = n.fail.time) 
    # Run loop 
    for(i in 1:n.subjects) { # Number of subjects 
    data.subject <- data[data$id == i, ] # subsetting here provides nice speed-up 
    for(j in 1:n.fail.time) { # Number of failure times. 
     value <- subset(data.subject, (start < fail.time[j]) & (stop >= fail.time[j]), select = transplant, drop = TRUE) 
     if(length(value) == 0) { # An early event or censor will return empty value. Assign to zero. 
     mat[i, j] <- 0 
     } 
     else { 
     mat[i, j] <- value # True value 
     } 
    } 
    } 
    mat 
} 

dplyr.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    # add a serial number for each id 
    data <- data %>% group_by(id) %>% mutate(id.serial = 1:length(start)) 
    # needed because transplant is a factor in the heart dataset 
    data$transplant <- as.integer(data$transplant) 
    # create a "long" format data.frame of subjects, observation number, and start/stop times 
    data.long <- data.frame(
    id = rep(1:n.subjects, each = n.fail.time * max(data$id.serial)), 
    id.serial = rep(1:max(data$id.serial), max(data$id.serial), n.subjects * max(data$id.serial)), 
    fail.time = rep(fail.time, each = max(data$id.serial)) 
) 
    # merge in start and stop times 
    data.merge <- left_join(data.long, data[, c("start", "stop", "transplant", "id", "id.serial")], by = c("id", "id.serial")) 
    # eliminate missings (for which no 2nd observation took place) 
    data.merge <- na.omit(data.merge) 
    # this replicates the "value" assignment in the loop 
    data.merge <- data.merge %>% mutate(value = transplant * ((start < fail.time) & (stop >= fail.time))) 
    # sum on the ids by fail time 
    data.merge <- data.merge %>% group_by(id, fail.time) %>% summarise(value = sum(value)) 
    # convert to a matrix 
    data.matrix <- matrix(data.merge$value, ncol = n.fail.time, byrow = TRUE, dimnames = list(1:n.subjects, fail.time)) 
    data.matrix 
} 

data.table.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    dataDT <- data.table(data[, c("id", "start", "stop", "transplant")]) 
    # add a serial number for each id 
    dataDT[, idObs := 1:length(start), by = id ] 
    # needed because transplant is a factor in the heart dataset 
    dataDT[, transplant := as.integer(transplant)] 
    # create a "long" format data.table of subjects, observation number, and start/stop times 
    matDT <- data.table(subject = rep(1:n.subjects, each = n.fail.time * max(dataDT$idObs)), 
         idObs = rep(1:max(dataDT$idObs), max(dataDT$idObs), n.subjects * max(dataDT$idObs)), 
         fail.time = rep(fail.time, each = max(dataDT$idObs))) 
    # merge in start and stop times 
    setkey(matDT, subject, idObs) 
    setkey(dataDT, id, idObs) 
    matDT <- dataDT[matDT] 
    # eliminate missings (for which no 2nd observation took place) 
    matDT <- matDT[!is.na(transplant)] 
    # this replicates the "value" assignment in the loop 
    matDT[, value := transplant * ((start < fail.time) & (stop >= fail.time))] 
    # sum on the ids by fail time 
    matDT2 <- matDT[, list(matVal = sum(value)), by = list(id, fail.time)] 
    # convert to a matrix 
    mat2 <- matrix(matDT2$matVal, ncol = n.fail.time, byrow = TRUE, dimnames = list(1:n.subjects, fail.time)) 
    mat2 
} 

all(dplyr.matrix(data) == old.matrix(data)) 
all(dplyr.matrix(data) == data.table.matrix(data)) 

microbenchmark(
    old.matrix(data), 
    dplyr.matrix(data), 
    data.table.matrix(data), 
    times = 50 
) 

輸出從微基準:

Unit: milliseconds 
        expr  min   lq  mean median  uq  max neval cld 
     old.matrix(data) 325.949687 328.102482 333.20923 329.39368 331.28305 373.44774 50 c 
     dplyr.matrix(data) 17.586146 18.317833 20.04662 18.95724 19.62431 60.15858 50 b 
data.table.matrix(data) 9.464045 9.892281 10.72819 10.29394 11.44812 12.67738 50 a 

上述結果對應有100觀察的數據集。當我在1000個觀察數據集上測試這個數據時,data.table開始拉開更多。

Unit: milliseconds 
        expr  min   lq  mean  median  uq  max neval cld 
     old.matrix(data) 13095.7836 13114.1858 13162.5019 13134.0735 13150.217 13318.2496  5 c 
     dplyr.matrix(data) 1067.1942 1075.5291 1149.0789 1166.8951 1197.998 1237.7787  5 b 
data.table.matrix(data) 104.5133 155.2074 159.6794 159.6364 166.764 212.2758  5 a 

data.table現在是贏家。