每行data.table函數太慢

我需要計算每行加權平均值（6M +行），但這需要很長時間。帶有權重的列是一個字符字段，因此加權。不能直接使用。每行data.table函數太慢

背景資料：

library(data.table) 
library(stringr) 
values <- c(1,2,3,4) 
grp <- c("a", "a", "b", "b") 
weights <- c("{10,0,0,0}", "{0,10,0,0}", "{10,10,0,0}", "{0,0,10,0}") 
DF <- data.frame(cbind(grp, weights)) 
DT <- data.table(DF) 

string.weighted.mean <- function(weights.x) { 
    tmp.1 <- na.omit(as.numeric(unlist(str_split(string=weights.x, pattern="[^0-9]+")))) 
    tmp.2 <- weighted.mean(x=values, w=tmp.1) 
}

這裏是如何可以做到（太慢）與data.frames：

DF$wm <- mapply(string.weighted.mean, DF$weights)

這做這項工作，但是太慢（小時）：

DT[, wm:=mapply(string.weighted.mean, weights)]

如何修改最後一行來加快速度？

來源

2013-01-23 Chris

你有一個很好的答案。只是補充一點：我很難想出更糟糕的輸入格式。如果可能的話，使用列表將權重存儲爲數字向量，效率永遠不會按行迭代，總是按列迭代。矩陣可能比data.table更適合這樣的任務。 –

DT[, rowid := 1:nrow(DT)] 
setkey(DT, rowid) 
DT[, wm :={ 
    weighted.mean(x=values, w=na.omit(as.numeric(unlist(str_split(string=weights, pattern="[^0-9]+")))))  
}, by=rowid]

來源

2013-01-23 01:23:59 Michael

製作'rowid'的好方法是使用'rowid：= .I' –

由於它沒有出現組與加權平均的計算有關，我試圖簡化這個問題。

 values <- seq(4) 

# A function to compute a string of length 4 with random weights 0 or 10 
    tstwts <- function() 
    { 
     w <- sample(c(0, 10), 4, replace = TRUE) 
     paste0("{", paste(w, collapse = ","), "}") 
    } 

# Generate 100K strings and put them into a vector 
    u <- replicate(1e5, tstwts()) 
    head(u) # Check 
    table(u) 

# Function to compute a weighted mean from a string using values 
# as an assumed external numeric vector 'values' of the same length as 
# the weights 
    f <- function(x) 
     { 
      valstr <- gsub("[\\{\\}]", "", x) 
      wts <- as.numeric(unlist(strsplit(valstr, ","))) 
      sum(wts * values)/sum(wts) 
     } 

# Execute the function f recursively on the vector of weights u 
    v <- sapply(u, f) 

# Some checks: 
    head(v) 
    table(v)

在我的系統，對於100K的重複，

> system.time(sapply(u, f)) 
    user system elapsed 
    3.79 0.00 3.83

這個（SANS組）的數據表版本將

DT <- data.table(weights = u) 
DT[, wt.mean := lapply(weights, f)]) 
head(DT) 
dim(DT)

在我的系統，這需要

system.time（DT [，wt.mean：= lapply（weights，f）]）用戶系統經過 3.62 0.03 3.69

因此預計每百萬觀察35-40比較的，比得上我的系統上（Win7的，2.8GHz的雙核芯片，8GB RAM）。因人而異。

來源

2013-01-23 05:17:37 Dennis

每行data.table函數太慢

回答

相關問題