用於添加矢量元素的Rcpp函數

我有很長的參數向量（約4^10個元素）和一個向量向量。我的目標是將所有在索引向量中索引的參數值相加。例如，如果我有paras = [1,2,3,4,5,5,5]和indices = [3,3,1,6]，那麼我想要找到的累計和第三個值（3）兩次，第一個值（1）和第六個（5），得到12.此外，還可以根據參數值的位置來翹曲參數值。用於添加矢量元素的Rcpp函數

我想加快R實現，因爲我稱它爲數百萬次。

我當前的代碼總是返回NA，我不能看到它是怎麼了？

這裏的RCPP功能：

double dot_prod_c(NumericVector indices, NumericVector paras, 
        NumericVector warp = NA_REAL) { 
int len = indices.size(); 
LogicalVector indices_ok; 
for (int i = 0; i < len; i++){ 
    indices_ok.push_back(R_IsNA(indices[i])); 
} 
if(is_true(any(indices_ok))){ 
    return NA_REAL; 
} 
double counter = 0; 
if(NumericVector::is_na(warp[1])){ 
    for (int i = 0; i < len; i++){ 
     counter += paras[indices[i]]; 
    } 
} else { 
    for (int i = 0; i < len; i++){ 
     counter += paras[indices[i]] * warp[i]; 
    } 
} 
return counter; 
}

這裏是工作[R版本：

dot_prod <- function(indices, paras, warp = NA){ 
    if(is.na(warp[1])){ 
     return(sum(sapply(indices, function(ind) paras[ind + 1]))) 
    } else { 
     return(sum(sapply(1:length(indices), function(i){ 
      ind <- indices[i] 
      paras[ind + 1] * warp[i] 
     }))) 
    } 
}

下面是一些測試代碼，並使用microbenchmark軟件包進行基準測試：

# testing 
library(Rcpp) 
library(microbenchmark) 

parameters <- list() 
indices <- list() 
indices_trad <- list() 

set.seed(2) 
for (i in 4:12){ 
    size <- 4^i 
    window_size <- 100 
    parameters[[i-3]] <- runif(size) 
    indices[[i-3]] <- floor(runif(window_size)*size) 
    temp <- rep(0, size) 
    for (j in 1:window_size){ 
     temp[indices[[i-3]][j] + 1] <- temp[indices[[i-3]][j] + 1] + 1 
    } 
    indices_trad[[i-3]] <- temp 
} 

microbenchmark(
    x <- sapply(1:9, function(i) dot_prod(indices[[i]], parameters[[i]])), 
    x_c <- sapply(1:9, function(i) dot_prod_c(indices[[i]], parameters[[i]])), 
    x_base <- sapply(1:9, function(i) indices_trad[[i]] %*% parameters[[i]]) 
) 
all.equal(x, x_base) # is true, does work 
all.equal(x_c, x_base) # not true - C++ version returns only NAs

來源

2015-12-24 Tom

首先'indices'應一個'IntegerVector'。其次，我不明白經線周圍的控制塊。「warp」是矢量還是標量？爲什麼你要引用第二個元素（C++索引從0開始，而不是1）？你不需要將默認值轉換爲矢量嗎？ –

我也很困惑你爲什麼使用'R_IsNA'和Rcpp糖':: is_na'。爲什麼不使用其中一個或另一個？ –

我在試圖通過代碼來解釋你的總體目標有點麻煩，所以我只是要去這個解釋

舉例來說，如果我有段= [1,2 ，3,4,5,5,5]和指數= [3,3,1,6] 那麼我想找到第三個值（3）兩次，第一個值（1）和第六個（5），得到12.另外還有根據他們的位置翹曲參數值的選項。

因爲這對我來說是最清楚的。

您的C++代碼有一些問題。要開始，而不是這樣做 - NumericVector warp = NA_REAL - 使用Rcpp::Nullable<>模板（如下所示）。這將解決幾個問題：

它更具可讀性。如果您對Nullable類不熟悉，則它幾乎就是它聽起來的樣子 - 一個可能爲或不可爲空的對象。
您不必進行任何尷尬的初始化，例如NumericVector warp = NA_REAL。坦率地說，我很驚訝編譯器接受了這個。
您不必擔心意外忘記了C++使用從零開始的索引，與R不同，如下面這行：if(NumericVector::is_na(warp[1])){。那個未定義的行爲寫在它上面。

這裏有一個修訂版，去把你的報價問題的說明以上：

#include <Rcpp.h> 

typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t; 
// [[Rcpp::export]] 
double DotProd(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) { 
    R_xlen_t i = 0, n = indices.size(); 
    double result = 0.0; 

    if (warp_.isNull()) { 
    for (; i < n; i++) { 
     result += params[indices[i]]; 
    }  
    } else { 
    Rcpp::NumericVector warp(warp_); 
    for (; i < n; i++) { 
     result += params[indices[i]] * warp[i]; 
    } 
    } 

    return result; 
}

你有一些複雜的代碼來生成樣本數據。我沒有花時間去做這件事，因爲沒有必要，基準也沒有。你自己說C++版本沒有產生正確的結果。你的首要任務應該是讓你的代碼在簡單的數據上工作。然後爲它提供一些更復雜的數據。然後基準。上述修訂版適用於簡單的數據：

args <- list(
    indices = c(3, 3, 1, 6), 
    params = c(1, 2, 3, 4, 5, 5, 5), 
    warp = c(.25, .75, 1.25, 1.75) 
) 

all.equal(
    DotProd(args[[1]], args[[2]]), 
    dot_prod(args[[1]], args[[2]])) 
#[1] TRUE 

all.equal(
    DotProd(args[[1]], args[[2]], args[[3]]), 
    dot_prod(args[[1]], args[[2]], args[[3]])) 
#[1] TRUE

它也快於該樣本數據將R版本。我沒有理由相信它不適用於更大，更復雜的數據 - 對應用函數沒有什麼神奇的或特別有效的;他們只是更地道/可讀R.

microbenchmark::microbenchmark(
    "Rcpp" = DotProd(args[[1]], args[[2]]), 
    "R" = dot_prod(args[[1]], args[[2]])) 
#Unit: microseconds 
#expr min  lq  mean median  uq max neval 
#Rcpp 2.463 2.8815 3.52907 3.3265 3.8445 18.823 100 
#R 18.869 20.0285 21.60490 20.4400 21.0745 66.531 100 
# 
microbenchmark::microbenchmark(
    "Rcpp" = DotProd(args[[1]], args[[2]], args[[3]]), 
    "R" = dot_prod(args[[1]], args[[2]], args[[3]])) 
#Unit: microseconds 
#expr min  lq  mean median  uq max neval 
#Rcpp 2.680 3.0430 3.84796 3.701 4.1360 12.304 100 
#R 21.587 22.6855 23.79194 23.342 23.8565 68.473 100

予省略從上面的例子中的NA檢查，但同樣可以修改到的東西，通過使用一點RCPP糖更慣用的。以前，你在做這個：

LogicalVector indices_ok; 
for (int i = 0; i < len; i++){ 
    indices_ok.push_back(R_IsNA(indices[i])); 
} 
if(is_true(any(indices_ok))){ 
    return NA_REAL; 
}

這是一個有點咄咄逼人 - 要測試的值的整體矢量（與R_IsNA），然後應用is_true(any(indices_ok)) - 當你可以只打破過早和一審返回NA_REALR_IsNA(indices[i])導致true。此外，使用push_back會使您的功能變慢 - 如果將indices_ok初始化爲已知大小，並通過循環中的索引訪問進行填充，您會更好。然而，這裏的凝結操作的一種方式：

if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL;

爲了完整起見，這裏是完全糖化的版本，可以讓你避免環路完全：

#include <Rcpp.h> 

typedef Rcpp::Nullable<Rcpp::NumericVector> nullable_t; 
// [[Rcpp::export]] 
double DotProd3(Rcpp::NumericVector indices, Rcpp::NumericVector params, nullable_t warp_ = R_NilValue) { 
    if (Rcpp::na_omit(indices).size() != indices.size()) return NA_REAL; 

    if (warp_.isNull()) { 
    Rcpp::NumericVector tmp = params[indices]; 
    return Rcpp::sum(tmp);  
    } else { 
    Rcpp::NumericVector warp(warp_), tmp = params[indices]; 
    return Rcpp::sum(tmp * warp); 
    } 
} 

/*** R 

all.equal(
    DotProd3(args[[1]], args[[2]]), 
    dot_prod(args[[1]], args[[2]])) 
#[1] TRUE 

all.equal(
    DotProd3(args[[1]], args[[2]], args[[3]]), 
    dot_prod(args[[1]], args[[2]], args[[3]])) 
#[1] TRUE 

*/

來源

2015-12-24 22:48:43 nrussell

像往常一樣，由@nrussell幹得不錯。感謝您抽出時間來確定最初問題的含義。 –

謝謝德克 - 一定是節日精神; ） – nrussell

感謝您的幫助@nrussell，您擊中了頭部。特別是，感謝您花時間清楚地解釋代碼的問題。爲了不讓問題更清楚而道歉，我越來越熟悉該網站，下次會做得更好。 – Tom

用於添加矢量元素的Rcpp函數

回答

相關問題