lapply vs for loop - Performance R

人們經常說，人們應該更喜歡lapply而不是for循環。 Hadley Wickham在他的Advance R書中指出了一些例外。（）（修改原地，遞歸等）。以下是這種情況之一。lapply vs for loop - Performance R

爲了學習的緣故，我嘗試用函數形式重寫感知器算法，以便基準測試的相對性能。來源（https://rpubs.com/FaiHas/197581）。

這是代碼。

# prepare input 
data(iris) 
irissubdf <- iris[1:100, c(1, 3, 5)] 
names(irissubdf) <- c("sepal", "petal", "species") 
head(irissubdf) 
irissubdf$y <- 1 
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1 
x <- irissubdf[, c(1, 2)] 
y <- irissubdf[, 4] 

# perceptron function with for 
perceptron <- function(x, y, eta, niter) { 

    # initialize weight vector 
    weight <- rep(0, dim(x)[2] + 1) 
    errors <- rep(0, niter) 


    # loop over number of epochs niter 
    for (jj in 1:niter) { 

    # loop through training data set 
    for (ii in 1:length(y)) { 

     # Predict binary label using Heaviside activation 
     # function 
     z <- sum(weight[2:length(weight)] * as.numeric(x[ii, 
     ])) + weight[1] 
     if (z < 0) { 
     ypred <- -1 
     } else { 
     ypred <- 1 
     } 

     # Change weight - the formula doesn't do anything 
     # if the predicted value is correct 
     weightdiff <- eta * (y[ii] - ypred) * c(1, 
     as.numeric(x[ii, ])) 
     weight <- weight + weightdiff 

     # Update error function 
     if ((y[ii] - ypred) != 0) { 
     errors[jj] <- errors[jj] + 1 
     } 

    } 
    } 

    # weight to decide between the two species 

    return(errors) 
} 

err <- perceptron(x, y, 1, 10) 

### my rewriting in functional form auxiliary 
### function 
faux <- function(x, weight, y, eta) { 
    err <- 0 
    z <- sum(weight[2:length(weight)] * as.numeric(x)) + 
    weight[1] 
    if (z < 0) { 
    ypred <- -1 
    } else { 
    ypred <- 1 
    } 

    # Change weight - the formula doesn't do anything 
    # if the predicted value is correct 
    weightdiff <- eta * (y - ypred) * c(1, as.numeric(x)) 
    weight <<- weight + weightdiff 

    # Update error function 
    if ((y - ypred) != 0) { 
    err <- 1 
    } 
    err 
} 

weight <- rep(0, 3) 
weightdiff <- rep(0, 3) 

f <- function() { 
    t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y), 
    function(i) { 
     faux(irissubdf[i, 1:2], weight, irissubdf$y[i], 
     1) 
    })))) 
    weight <<- rep(0, 3) 
    t 
}

我沒有任何預期持續改善由於上述問題。但是，當我看到使用lapply和replicate的劇烈惡化時，我真的很驚訝。

我獲得使用microbenchmark功能從microbenchmark庫

這個結果怎麼可能是什麼原因？難道是一些內存泄漏？

             expr  min   lq  mean  median   uq 
                 f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440 
    perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 4184.131 4437.2990 4686.7506 4532.6655 4751.4795 
perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 95.793 104.2045 123.7735 116.6065 140.5545 
     max neval 
109715.673 100 
    6513.684 100 
    264.858 100

第一個功能是所述lapply/replicate功能

第二與for環

第三是在C++相同功能的使用Rcpp

這裏根據羅蘭功能功能分析。我不確定我能否以正確的方式解讀它。它看起來像我的大部分時間都花在子集化 Function profiling

來源

2017-02-22 Federico Manigrasso

請確定。我沒有看到在函數f中調用'apply'。 – Roland

我建議你學習如何剖析函數：http://adv-r.had.co.nz/Profiling.html – Roland

你的代碼有幾個錯誤;首先，'irissubdf [，4] < - 1'應該是'irissubdf $ y < - 1'，所以您可以稍後使用該名稱，其次，'weight'在您在'f'中使用之前未定義。我也不清楚'' - ''在'lapply'和'replicate'命令中做了正確的事情，但是我不清楚它應該做什麼。這也可能是兩者之間的主要區別; '<< - '必須處理環境，而另一個則不處理，雖然我不確切知道可能會有什麼影響，但它不再是蘋果來比較蘋果。 – Aaron

首先，它是一個已經很長揭穿神話，for循環是任何慢於lapply。 R中的for環路的性能更高，目前至少與lapply一樣快。

這就是說，你必須重新考慮在這裏使用lapply。您的實施需要分配給全球環境，因爲您的代碼要求您在循環過程中更新權重。這是不考慮lapply的合理原因。

lapply是您應該用於其副作用（或缺乏副作用）的功能。功能lapply將結果自動合併到列表中，並且不會與您工作的環境混淆，這與for循環相反。 replicate也是如此。也看到這個問題：

Is R's apply family more than syntactic sugar?

的原因，您的lapply的解決方案是慢得多，是因爲你使用它的方式創造了很多更多的開銷。

replicate是什麼都沒有，但sapply內部，所以你實際上結合sapply和lapply來實現您的雙迴路。 sapply會產生額外的開銷，因爲它必須測試結果是否可以簡化。因此for循環實際上比使用replicate快。
在您的lapply匿名函數中，您必須訪問每個觀測的x和y的數據幀。這意味着 - 與您的for-loop相反 - 例如功能$必須每次調用。
因爲你使用這些高端的功能，你的「lapply」的解決方案要求49層的功能，比起你for的解決方案，只要求26的lapply解決這些額外的功能包括names像match功能，structure，[[，電話，%in%，sys.call,duplicated，... 您的for循環不需要的所有功能，因爲該功能不會執行任何這些檢查。

如果你想看到這個額外的開銷從何而來，看看replicate，unlist，sapply和simplify2array內部代碼。

您可以使用以下代碼更好地瞭解您在lapply處失去效果的位置。一行一行地運行！

Rprof(interval = 0.0001) 
f() 
Rprof(NULL) 
fprof <- summaryRprof()$by.self 

Rprof(interval = 0.0001) 
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 
Rprof(NULL) 
perprof <- summaryRprof()$by.self 

fprof$Fun <- rownames(fprof) 
perprof$Fun <- rownames(perprof) 

Selftime <- merge(fprof, perprof, 
        all = TRUE, 
        by = 'Fun', 
        suffixes = c(".lapply",".for")) 

sum(!is.na(Selftime$self.time.lapply)) 
sum(!is.na(Selftime$self.time.for)) 
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE), 
     c("Fun","self.time.lapply","self.time.for")] 

Selftime[is.na(Selftime$self.time.for),]

來源

2017-02-24 14:14:49

lapply vs for loop - Performance R

回答

相關問題