2017-02-13 108 views
0

我被告知沒有必要在R中有一個「for」循環。所以,我想看看我能在R代碼裏面擺脫這條巨蟒般的「for」循環:如何擺脫這個循環

diff.vec = c() # vector of differences 
    for (index in 1:nrow(yrdf)) { # yrdf is a data frame 
    if (index == numrows) { 
     diff = NA # because there is no entry "below" it 
    } else { 
     val_index = yrdf$Adj.Close[index] 
     val_next = yrdf$Adj.Close[index+1] 
     diff = val_index - val_next # diff between two adjacent values 
     diff = diff/yrdf$Adj.Close[index+1] * 100.0 
    } 
    diff.vec<-c(diff.vec,diff) # append to vector of differences 
    } 
+2

0123'有'diff'函數來獲取'R'中的相鄰元素的差異另外,檢查'dplyr'中的'lead'和'lag'函數 – akrun

+7

誰告訴你這是錯誤的。一些操作將需要一個循環。 –

+0

有時候'for loops'是首選的方法。請參閱[this](http://stackoverflow.com/a/6466415/4408538)文章,瞭解何時實施「for循環」的詳細說明。查看這些帖子以更好地理解R的循環結構:[post1](http://stackoverflow.com/a/2276001/4408538)和[post2](http://stackoverflow.com/q/28983292/4408538)。 –

回答

1

根據我的經驗,有三個原因可以避免for循環。首先是他人可能難以閱讀(如果你分享你的代碼),並且功能家族可以改善這一點(並且對收益更加明確)。第二種是在某些情況下可能帶來的速度優勢,特別是如果您想讓代碼並行運行(例如,大多數apply函數非常平行,而for循環需要更多工作來分解)。

但是,這是你在這裏服務你的第三個原因:向量化解決方案通常比上述任何方法都要好,因爲它避免了重複調用(例如,在循環結尾的c,檢查if等) 。在這裏,你可以用一個矢量化的調用來完成所有的事情。

首先,一些樣本數據

set.seed(8675309) 
yrdf <- data.frame(Adj.Close = rnorm(5)) 

然後,我們乘100一切,把相鄰條目的diffAdj.Close和使用矢量除以以下條目來劃分。請注意,如果(且僅當)您需要結果與輸入的長度相同時,我需要填充NA。如果你不想/需要這個向量末尾的NA,它可以更容易。

100 * c(diff(yrdf$Adj.Close),NA)/c(yrdf$Adj.Close[2:nrow(yrdf)], NA) 

返回

[1] 238.06442 216.94975 130.41349 -90.47879  NA 

而且,要明確,這裏是microbenchmark比較:

myForLoop <- function(){ 
    numrows = nrow(yrdf) 
    diff.vec = c() # vector of differences 
    for (index in 1:nrow(yrdf)) { # yrdf is a data frame 
    if (index == numrows) { 
     diff = NA # because there is no entry "below" it 
    } else { 
     val_index = yrdf$Adj.Close[index] 
     val_next = yrdf$Adj.Close[index+1] 
     diff = val_index - val_next # diff between two adjacent values 
     diff = diff/yrdf$Adj.Close[index+1] * 100.0 
    } 
    diff.vec<-c(diff.vec,diff) # append to vector of differences 
    } 
    return(diff.vec) 
} 

microbenchmark::microbenchmark(
    forLoop = myForLoop() 
    , vector = 100 * c(diff(yrdf$Adj.Close),NA)/c(yrdf$Adj.Close[2:nrow(yrdf)], NA) 
) 

給出:

Unit: microseconds 
    expr min  lq  mean median  uq  max neval 
forLoop 74.238 78.184 82.06786 81.287 84.3740 104.190 100 
    vector 20.193 21.718 23.91824 22.716 24.0535 80.754 100 

注意,vector辦法採取s約爲for循環的30%。這得到作爲數據大小的增加更重要的是:

set.seed(8675309) 
yrdf <- data.frame(Adj.Close = rnorm(10000)) 

microbenchmark::microbenchmark(
    forLoop = myForLoop() 
    , vector = 100 * c(diff(yrdf$Adj.Close),NA)/c(yrdf$Adj.Close[2:nrow(yrdf)], NA) 
) 

Unit: microseconds 
    expr  min   lq  mean  median   uq  max neval 
forLoop 306883.977 315116.446 351183.7997 325211.743 361479.6835 545383.457 100 
    vector 176.704 194.948 326.6135 219.512 236.9685 4989.051 100 

注意,在這些規模如何龐大差異 - 矢量版本採用的小於0.1%運行的時間。在這裏,這可能是因爲每次調用c添加新條目都需要重新讀取完整的向量。略有變化可以加速for循環了一下,但沒有得到它一路矢量速度:

myForLoopAlt <- function(){ 
    numrows = nrow(yrdf) 
    diff.vec = numeric(numrows) # vector of differences 
    for (index in 1:nrow(yrdf)) { # yrdf is a data frame 
    if (index == numrows) { 
     diff = NA # because there is no entry "below" it 
    } else { 
     val_index = yrdf$Adj.Close[index] 
     val_next = yrdf$Adj.Close[index+1] 
     diff = val_index - val_next # diff between two adjacent values 
     diff = diff/yrdf$Adj.Close[index+1] * 100.0 
    } 
    diff.vec[index] <- diff # append to vector of differences 
    } 
    return(diff.vec) 
} 



microbenchmark::microbenchmark(
    forLoop = myForLoop() 
    , newLoop = myForLoopAlt() 
    , vector = 100 * c(diff(yrdf$Adj.Close),NA)/c(yrdf$Adj.Close[2:nrow(yrdf)], NA) 
) 

Unit: microseconds 
    expr  min   lq  mean  median   uq  max neval 
forLoop 304751.250 315433.802 354605.5850 325944.9075 368584.2065 528732.259 100 
newLoop 168014.142 179579.984 186882.7679 181843.7465 188654.5325 318431.949 100 
    vector 169.569 208.193 331.2579 219.9125 233.3115 2956.646 100 

這節省了一半的時間關閉for循環的方法,但仍然比矢量化解決方案慢得多。

+0

哇!這太妙了。感謝您的洞察力。 – whirlaway

+0

@whirlaway - 是否回答你的問題? –

+0

是的,它的確如此。和更多。非常感謝你。 – whirlaway

0
yrdf <- data.frame(Adj.Close = rnorm(100)) 
numrows <- length(yrdf$Adj.Close) 
diff.vec <- c((yrdf$Adj.Close[1:(numrows-1)]/yrdf$Adj.Close[2:numrows] - 1) * 100, NA) 
0

您還可以使用lead功能從dplyr包來獲取結果你要的那個。

library(dplyr) 
yrdf <- data.frame(Adj.Close = rnorm(100)) 
(yrdf$Adj.Close/lead(yrdf$Adj.Close)-1)*100 

計算已從(a-b)/ b簡化爲a/b-1。這是一個矢量化操作,而不是for循環。