2017-09-13 140 views
1

我正在嘗試對依賴於組內兩個先前元素的分組數據執行迭代計算。作爲一個玩具的例子:dplyr group_by和迭代循環計算

set.seed(100) 
df = data.table(ID = c(rep("A_index1",9)), 
      Year = c(2001:2005, 2001:2004), 
      Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA), 
      Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE)) 
    ID Year Price Index 

R> df 
1: A_index1 2001 NA 1.5 
2: A_index1 2002 NA 1.5 
3: A_index1 2003 10 2.0 
4: A_index1 2004 NA 1.0 
5: A_index1 2005 NA 2.0 
6: A_index1 2006 15 2.0 
7: A_index1 2007 NA 3.0 
8: A_index1 2008 13 1.5 
9: A_index1 2009 NA 2.0 

其目標是填補缺失的價格使用最後的可用價格和指數進行調整。我有一個循環執行這些計算,我試圖使用dplyr進行矢量化。

我的邏輯是在下面的循環定義:

df$Price_adj = df$Price 
for (i in 2:nrow(df)) { 
    if (is.na(df$Price[i])) { 
    df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i]/df$Index[i-1], 2) 
    } 
} 

R> df 
     ID Year Price Index Price_adj 
1: A_index1 2001 NA 1.5  NA 
2: A_index1 2002 NA 1.5  NA 
3: A_index1 2003 10 2.0  10.00 
4: A_index1 2004 NA 1.0  5.00 
5: A_index1 2005 NA 2.0  10.00 
6: A_index1 2006 15 2.0  15.00 
7: A_index1 2007 NA 3.0  22.50 
8: A_index1 2008 13 1.5  13.00 
9: A_index1 2009 NA 2.0  17.33 

在我的實際大的數據,我將不得不這一功能應用到多個團體和速度是一個考慮因素。我在這方面的嘗試如下,這需要幫助指向正確的方向。我確實考慮過Reduce,但不確定它如何在組中包含前兩個元素。與cumprod

foo = function(Price, Index){ 
    for (i in 2:nrow(df)) { 
    if (is.na(df$Price[i])) { 
     df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i]/df$Index[i-1] 
    } 
    } 
} 

df %>% 
    group_by(ID) %>% 
    mutate(Price_adj = Price, 
     Price_adj = foo(Price, Index)) 

回答

2

一個選項:

df %>% 
    # group data frame into chunks starting from non na price 
    group_by(ID, g = cumsum(!is.na(Price))) %>% 
    # for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1] 
    mutate(Price_adj = round(first(Price) * cumprod(Index/lag(Index, default=first(Index))), 2)) %>% 
    ungroup() %>% select(-g) 

# A tibble: 9 x 5 
#  ID Year Price Index Price_adj 
# <fctr> <int> <dbl> <dbl>  <dbl> 
#1 A_index1 2001 NA 1.5  NA 
#2 A_index1 2002 NA 1.5  NA 
#3 A_index1 2003 10 2.0  10.00 
#4 A_index1 2004 NA 1.0  5.00 
#5 A_index1 2005 NA 2.0  10.00 
#6 A_index1 2001 15 2.0  15.00 
#7 A_index1 2002 NA 3.0  22.50 
#8 A_index1 2003 13 1.5  13.00 
#9 A_index1 2004 NA 2.0  17.33 
  • 組數據由IDcumsum(!is.na(Price))幀,信分割數據幀分成塊和每個塊開始與非NA價格;

  • first(Price) * cumprod(Index/lag(Index, default=first(Index)))確實迭代計算,這相當於如果用Price_adj[i-2]代替Price_adj[i-1],直到它的Price_adj[1]first(Price)在問題給出的公式;

警告:如果你有很多NA塊,效率可能不是很高。


如果速度是首要關注的問題,你可以使用Rcpp包寫你的函數:

library(Rcpp) 
cppFunction(" 
    NumericVector price_adj(NumericVector price, NumericVector index) { 
     int n = price.size(); 
     NumericVector adjusted_price(n); 
     adjusted_price[0] = price[0]; 
     for (int i = 1; i < n; i++) { 
      if(NumericVector::is_na(price[i])) { 
       adjusted_price[i] = adjusted_price[i-1] * index[i]/index[i-1]; 
      } else { 
       adjusted_price[i] = price[i]; 
      } 
     } 
     return adjusted_price; 
    }") 

現在使用cpp函數dplyr如下:

cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2)) 

cpp_fun() 
# A tibble: 9 x 5 
# Groups: ID [1] 
#  ID Year Price Index Price_adj 
# <fctr> <int> <dbl> <dbl>  <dbl> 
#1 A_index1 2001 NA 1.5  NA 
#2 A_index1 2002 NA 1.5  NA 
#3 A_index1 2003 10 2.0  10.00 
#4 A_index1 2004 NA 1.0  5.00 
#5 A_index1 2005 NA 2.0  10.00 
#6 A_index1 2001 15 2.0  15.00 
#7 A_index1 2002 NA 3.0  22.50 
#8 A_index1 2003 13 1.5  13.00 
#9 A_index1 2004 NA 2.0  17.33 

Benchmark

定義r_fun爲:

r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index/lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g) 

在小樣本數據,有已經是一個區別:

microbenchmark::microbenchmark(r_fun(), cpp_fun()) 
#Unit: milliseconds 
#  expr  min  lq  mean median  uq  max neval 
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100 
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100 

測試稍大數據幀:

df <- bind_rows(rep(list(df), 10000)) 
#dim(df) 
#[1] 90000  4 

microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10) 
#Unit: milliseconds 
#  expr  min   lq  mean median  uq  max neval 
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10 
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10 

身份測試

identical(ungroup(r_fun()), ungroup(cpp_fun())) 
# [1] TRUE 
+0

您可以添加更多mutate步驟的解釋嗎? – Divi

+0

更新了一些說明... – Psidom

+1

該公式的設置方式,該解決方案可以很容易地在下面使用,而不需要'cumprod'來提高效率。最有可能不需要'Rcpp'。謝謝。 'mutate(Price_adj = round(first(Price)* Index/first(Index),2))' – Divi