使用c/C++或向量化加速條件性R循環

我需要創建一個矩陣，其中包含有關數據集中其他變量的協變量值的矩陣。這是我當前的解決方案的一個自包含的示例使用c/C++或向量化加速條件性R循環

library(dplyr) 
library(survival) 
library(microbenchmark) 
data(heart, package = "survival") 
data <- heart 

# Number of unique subjects 
n.sub <- data %>% group_by(id) %>% n_groups() 
# Unique failure times 
fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
# Number of unique failure times 
n.fail.time <- length(fail.time) 

# Pre-fill matrix. Will be filled with covariate values. 
mat <- matrix(NA_real_, nrow = n.sub, ncol = n.fail.time) 
# Run loop 
for(i in 1:n.sub) { # Number of subjects 
    data.subject <- data[data$id == i, ] # subsetting here provides nice speed-up 
    for(j in 1:n.fail.time) { # Number of failure times. 
    value <- subset(data.subject, (start < fail.time[j]) & (stop >= fail.time[j]), select = transplant, drop = TRUE) 
    if(length(value) == 0) { # An early event or censor will return empty value. Assign to zero. 
     mat[i, j] <- 0 
    } 
    else { 
     mat[i, j] <- value # True value 
    } 
    } 
}

對於包含數千個觀察值的數據集，這太慢了。我不知道如何使用R代碼來最好地實現這種矢量化，並且我不知道使用Rcpp足夠的c/C++。如何通過這些（或其他）選項加速這個示例？

看起來timereg package中的src/aalen.c文件可能有一個類似於我的問題的c解決方案。請參閱if ((start[c]<time) && (stop[c]>=time))附近的代碼。儘管這可能只是我對c /編程表現的無知。

來源

2015-07-22 kzoo

我真的不知道R，但我可以讀一些行，並告訴一些等價的C.有沒有辦法使用Rcpp來轉換庫？ –

一旦它在C中，矢量化部分不是太難。 –

@JeremiahDicharry：不，沒有辦法自動轉換庫。 Rcpp不是一個R-to-C（++）編譯器，而是「僅僅」一個相當有用的膠片。有一些臨時演員（如vectorisation）。 –

我正面臨類似的選擇，轉向C++來加速一個不同的問題，但我最終轉向了已經在C++中高效實現並使用這些R包。在這裏，你想要的是一個名爲data.table的軟件包。

如果您是R的新手，這可能很難遵循，但通過here的配置文件可以獲得data.table包的良好文檔。爲了瞭解下面發生的事情，您可能會逐步瞭解我測試過的簡化數據集的代碼（請參閱答案的底部），並在對象更改值時監視它們。提高速度的關鍵是使用data.table的快速賦值方法，並僅執行矢量化操作。

我的解決方案如下。請注意，我不確定是否需要0,1,2值，但我很樂意將代碼更改爲0,1，如果這符合您的預期。

require(data.table) 

dataDT <- data.table(data[, c("id", "start", "stop", "transplant")]) 
# add a serial number for each id 
dataDT[, idObs := 1:length(start), by = id ] 
# needed because transplant is a factor in the heart dataset 
dataDT[, transplant := as.integer(transplant)] 

# create a "long" format data.table of subjects, observation number, and start/stop times 
matDT <- data.table(subject = rep(1:n.sub, each = n.fail.time * max(dataDT$idObs)), 
        idObs = rep(1:max(dataDT$idObs), max(dataDT$idObs), n.sub * max(dataDT$idObs)), 
        fail.time = rep(fail.time, each = max(dataDT$idObs))) 

# merge in start and stop times 
setkey(matDT, subject, idObs) 
setkey(dataDT, id, idObs) 
matDT <- dataDT[matDT] 

# eliminate missings (for which no 2nd observation took place) 
matDT <- matDT[!is.na(transplant)] 

# this replicates the "value" assignment in the loop 
matDT[, value := transplant * ((start < fail.time) & (stop >= fail.time))] 

# sum on the ids by fail time 
matDT2 <- matDT[, list(matVal = sum(value)), by = list(id, fail.time)] 

# convert to a matrix 
mat2 <- matrix(matDT2$matVal, ncol = ncol(mat), byrow = TRUE, dimnames = list(1:n.sub, fail.time))

這是不是你的代碼快很多倍，根據microbenchmark()，其中第一種方法是從問題的代碼：

 min   lq  mean  median   uq  max neval 
310.503535 339.364159 396.287178 354.292829 406.937216 762.28838 100 
    7.113083 7.420517 9.436973 7.788479 9.426443 32.50355 100

要顯示的輸出，我測試這對前六你的data對象的行。這提供了一個很好的例子，因爲第三和第四名患者（id = 3,4）在移植之前和之後都有兩個觀察值。

data <- heart[1:6, ]

，然後我添加的行和列的標籤，以您的mat對象：

colnames(mat) <- fail.time 
rownames(mat) <- 1:n.sub 
mat 
## 6 16 39 50 
## 1 1 1 1 1 
## 2 1 0 0 0 
## 3 2 2 0 0 
## 4 1 1 2 0

在這裏，您可以看到新mat2是相同的：

mat2 
## 6 16 39 50 
## 1 1 1 1 1 
## 2 1 0 0 0 
## 3 2 2 0 0 
## 4 1 1 2 0 
all.equal(mat, mat2) 
## [1] TRUE

來源

2015-07-22 06:39:09

這太好了，謝謝。我將你的解決方案轉換爲dplyr，因爲我對此更加熟悉，但是你的解決方案仍然是最快的！ – kzoo

什麼是「長度（開始）」？這不應該只是'dataDT [，idObs：= seq_len（.N），by = id]'？也不應該從一個因素的轉換應該'as.integer（as.character（移植））'？ –

@DavidArenburg我也質疑那部分。我將其解釋爲一種簡單的方法來獲取'id''可以重複的最大次數。你的代碼可能會更好。您的因子轉換會產生更直觀的結果，但對於因子爲字符串的情況，返回整數因子級別會更好。 – kzoo

這是一個dplyr版本@KenBenoit解決方案（請參閱dplyr.matrix函數）。以下是測試所有三種方法的代碼。

library(dplyr) 
library(data.table) 
library(survival) 
library(microbenchmark) 
data(heart, package = "survival") 
data <- heart 

old.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    # Pre-fill matrix. Will be filled with covariate values. 
    mat <- matrix(NA_real_, nrow = n.subjects, ncol = n.fail.time) 
    # Run loop 
    for(i in 1:n.subjects) { # Number of subjects 
    data.subject <- data[data$id == i, ] # subsetting here provides nice speed-up 
    for(j in 1:n.fail.time) { # Number of failure times. 
     value <- subset(data.subject, (start < fail.time[j]) & (stop >= fail.time[j]), select = transplant, drop = TRUE) 
     if(length(value) == 0) { # An early event or censor will return empty value. Assign to zero. 
     mat[i, j] <- 0 
     } 
     else { 
     mat[i, j] <- value # True value 
     } 
    } 
    } 
    mat 
} 

dplyr.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    # add a serial number for each id 
    data <- data %>% group_by(id) %>% mutate(id.serial = 1:length(start)) 
    # needed because transplant is a factor in the heart dataset 
    data$transplant <- as.integer(data$transplant) 
    # create a "long" format data.frame of subjects, observation number, and start/stop times 
    data.long <- data.frame(
    id = rep(1:n.subjects, each = n.fail.time * max(data$id.serial)), 
    id.serial = rep(1:max(data$id.serial), max(data$id.serial), n.subjects * max(data$id.serial)), 
    fail.time = rep(fail.time, each = max(data$id.serial)) 
) 
    # merge in start and stop times 
    data.merge <- left_join(data.long, data[, c("start", "stop", "transplant", "id", "id.serial")], by = c("id", "id.serial")) 
    # eliminate missings (for which no 2nd observation took place) 
    data.merge <- na.omit(data.merge) 
    # this replicates the "value" assignment in the loop 
    data.merge <- data.merge %>% mutate(value = transplant * ((start < fail.time) & (stop >= fail.time))) 
    # sum on the ids by fail time 
    data.merge <- data.merge %>% group_by(id, fail.time) %>% summarise(value = sum(value)) 
    # convert to a matrix 
    data.matrix <- matrix(data.merge$value, ncol = n.fail.time, byrow = TRUE, dimnames = list(1:n.subjects, fail.time)) 
    data.matrix 
} 

data.table.matrix <- function(data) { 
    # Number of unique subjects 
    n.subjects <- data %>% group_by(id) %>% n_groups() 
    # Unique failure times 
    fail.time <- data %>% filter(event == 1) %>% distinct(stop) %>% arrange(stop) %>% .$stop 
    # Number of unique failure times 
    n.fail.time <- length(fail.time) 

    dataDT <- data.table(data[, c("id", "start", "stop", "transplant")]) 
    # add a serial number for each id 
    dataDT[, idObs := 1:length(start), by = id ] 
    # needed because transplant is a factor in the heart dataset 
    dataDT[, transplant := as.integer(transplant)] 
    # create a "long" format data.table of subjects, observation number, and start/stop times 
    matDT <- data.table(subject = rep(1:n.subjects, each = n.fail.time * max(dataDT$idObs)), 
         idObs = rep(1:max(dataDT$idObs), max(dataDT$idObs), n.subjects * max(dataDT$idObs)), 
         fail.time = rep(fail.time, each = max(dataDT$idObs))) 
    # merge in start and stop times 
    setkey(matDT, subject, idObs) 
    setkey(dataDT, id, idObs) 
    matDT <- dataDT[matDT] 
    # eliminate missings (for which no 2nd observation took place) 
    matDT <- matDT[!is.na(transplant)] 
    # this replicates the "value" assignment in the loop 
    matDT[, value := transplant * ((start < fail.time) & (stop >= fail.time))] 
    # sum on the ids by fail time 
    matDT2 <- matDT[, list(matVal = sum(value)), by = list(id, fail.time)] 
    # convert to a matrix 
    mat2 <- matrix(matDT2$matVal, ncol = n.fail.time, byrow = TRUE, dimnames = list(1:n.subjects, fail.time)) 
    mat2 
} 

all(dplyr.matrix(data) == old.matrix(data)) 
all(dplyr.matrix(data) == data.table.matrix(data)) 

microbenchmark(
    old.matrix(data), 
    dplyr.matrix(data), 
    data.table.matrix(data), 
    times = 50 
)

輸出從微基準：

Unit: milliseconds 
        expr  min   lq  mean median  uq  max neval cld 
     old.matrix(data) 325.949687 328.102482 333.20923 329.39368 331.28305 373.44774 50 c 
     dplyr.matrix(data) 17.586146 18.317833 20.04662 18.95724 19.62431 60.15858 50 b 
data.table.matrix(data) 9.464045 9.892281 10.72819 10.29394 11.44812 12.67738 50 a

上述結果對應有100觀察的數據集。當我在1000個觀察數據集上測試這個數據時，data.table開始拉開更多。

Unit: milliseconds 
        expr  min   lq  mean  median  uq  max neval cld 
     old.matrix(data) 13095.7836 13114.1858 13162.5019 13134.0735 13150.217 13318.2496  5 c 
     dplyr.matrix(data) 1067.1942 1075.5291 1149.0789 1166.8951 1197.998 1237.7787  5 b 
data.table.matrix(data) 104.5133 155.2074 159.6794 159.6364 166.764 212.2758  5 a

data.table現在是贏家。

來源

2015-07-22 19:20:21 kzoo

使用c/C++或向量化加速條件性R循環

回答

相關問題