data.tables和掃描函數

使用data.table，這將是在選定的列上「掃出」統計信息的最快方法嗎？data.tables和掃描函數

與（相當大的版本）DT

p <- 3 
DT <- data.table(id=c("A","B","C"),x1=c(10,20,30),x2=c(20,30,10)) 
DT.totals <- DT[, list(id,total = x1+x2) ]

開始，我想通過索引目標列（2：P）獲得以下data.table結果以跳過鍵：

id x1 x2 
[1,] A 0.33 0.67 
[2,] B 0.40 0.60 
[3,] C 0.75 0.25

來源

2012-04-11 M.Dimo

我相信，一些接近以下（使用相對較新的set()功能）將是最快的：

DT <- data.table(id = c("A","B","C"), x1 = c(10,20,30), x2 = c(20,30,10)) 
total <- DT[ , x1 + x2] 

rr <- seq_len(nrow(DT)) 
for(j in 2:3) set(DT, rr, j, DT[[j]]/total) 
DT 
#  id  x1  x2 
# [1,] A 0.3333333 0.6666667 
# [2,] B 0.4000000 0.6000000 
# [3,] C 0.7500000 0.2500000

FWIW，調用set()的形式如下：

# set(x, i, j, value), where: 
#  x is a data.table 
#  i contains row indices 
#  j contains column indices 
#  value is the value to be assigned into the specified cells

我對這個相對速度的懷疑，相對於其他的解決方案，是基於data.table's NEWS file這個通道，在變化的部分在1.8.0版本：

o New function set(DT,i,j,value) allows fast assignment to elements 
    of DT. Similar to := but avoids the overhead of [.data.table, so is 
    much faster inside a loop. Less flexible than :=, but as flexible 
    as matrix subassignment. Similar in spirit to setnames(), setcolorder(), 
    setkey() and setattr(); i.e., assigns by reference with no copy at all. 

     M = matrix(1,nrow=100000,ncol=100) 
     DF = as.data.frame(M) 
     DT = as.data.table(M) 
     system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s 
     system.time(for (i in 1:1000) DT[i,V1:=i])  # 1.158s 
     system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s 
     system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s

來源

2012-04-11 17:54:50

感謝您的回答。我已升級到data.table 1.8.0，併成功運行上面的測試代碼。當分子和分母都是來自data.tables的整數列時，我確實會得到一個詳細的警告（不適合在這裏）強制要加倍。我將編輯這個問題。 –

今天我在編輯時遇到困難：沒有換行。無論如何，這裏是代碼：for（j in 2：p）set（dt，allrows，j，dt [[j]]/denom [[2]]） }對於dt和denom，列2到p是整數。我得到的警告是 –

「Warning message： In set（dt，allrows，j，dt [[j]]/denom [[2]]）：強制'double'RHS爲'integer'以匹配列的類型;可能會截斷精度，或者先將目標列更改爲'double'（通過創建一個新的'double'向量長度16863（整個表的nrows）並分配該列;即'replace'列），或者將RHS強制爲'integer '（例如1L，NA_ [real | integer] _，as。*等），以使您的意圖更加清晰和速度更快，或者，請在創建表格時正確設置列類型並堅持。 –

data.tables和掃描函數

回答

相關問題