2017-08-31 105 views
2

我需要更新稀疏矩陣中的某些列,但操作時間過長,以至於無法完成。R - 在非常大的稀疏矩陣中更新列

我有一個少於3M行和1500列左右的稀疏矩陣。我也有一個相同數量的行的數據框,但只有10列。我想用data.frame中的值更新矩陣中的某些列索引。

我用正常矩陣做這件事沒有問題,但是當用稀疏矩陣嘗試它時,甚至需要一個單獨的列。

以下是我正在使用的代碼,需要更改哪些內容纔能有效運行?

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

for (i in 1:5){ 
    x[,var_nums[i]] <- df[,i] 
} 

回答

1

我能得到它完成下使用Matrix::cBind功能1秒,通過消除for循環。

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

t <- Sys.time() 
x   <- x[,-var_nums] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
Sys.time()-t 
Time difference of 0.541054 secs 

WITH ORDER PRESERVED (靜止不到1秒鐘!)

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

colnames(x) <- paste("col", 1:ncol(x)) 
col.order <- colnames(x) 

cols <- sample(colnames(x),size = 5) 
colnames(df) <- cols 

t <- Sys.time() 
x   <- x[,-which(colnames(x) %in% cols)] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
x   <- x[,col.order] 
Sys.time()-t 
>  Time difference of 0.550012 secs 

# Proof that order is preserved: 
identical(colnames(x), col.order) 

TRUE

1

可以使用ijx符號的sparseMatrix

library(Matrix) 

# data 
set.seed(1) 
# Changed the dim size to fit in my laptop memory 
nc=10 
nr=100 
n=5 

df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 

#Yours  
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 
for (i in 1:n){ 
    x[,var_nums[i]] <- df[,i] 
} 

# new version 
i = ((which(df==1)-1) %% nr) +1 
j = rep(var_nums, times=colSums(df)) 
y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 

all.equal(x, y, check.attributes=FALSE) 

比較速度

f1 <- function(){  
    for (i in 1:n){ 
     x[,var_nums[i]] <- df[,i] 
    } 
    x 
} 

f2 <- function(){ 
    i = ((which(df==1)-1) %% nr) +1 
    j = rep(var_nums, times=colSums(df)) 
    y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 
    y 
} 

microbenchmark::microbenchmark(f1(), f2()) 

Unit: milliseconds 
expr  min  lq  mean median  uq  max neval cld 
f1() 4.594229 4.694205 5.010071 4.770475 4.891649 12.666554 100 b 
f2() 1.274745 1.298663 1.464237 1.329534 1.392146 7.153076 100 a 

嘗試更大

nc=100 
nr=10000 
n=50 
set.seed(1) 
df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 

all.equal(f1(), f2(), check.attributes=FALSE) 

microbenchmark::microbenchmark(f1(), f2(), times=1) 
Unit: milliseconds 
expr   min   lq  mean  median   uq   max neval 
f1() 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251  1 
f2() 60.87275 60.87275 60.87275 60.87275 60.87275 60.87275  1 
0

這是略顯繁瑣,但你可以在需要的列綁定在一起像這

Nc = NCOL(x) 

    Matrix(cbind(
    x[, 1:(var_nums[1]-1)], 
    df[, 1], 
    x[, (var_nums[1]+1):(var_nums[2]-1)], 
    df[, 2], 
    x[, (var_nums[2]+1):(var_nums[3]-1)], 
    df[, 3], 
    x[, (var_nums[3]+1):(var_nums[4]-1)], 
    df[, 4], 
    x[, (var_nums[4]+1):(var_nums[5]-1)], 
    df[, 5], 
    x[, (var_nums[5]+1):Nc]), 
    sparse = TRUE) 

當df只有5列插入時,這並不算太壞。如果df有更多或者不同數量的列,那麼不同的語法可能更合適。無論如何,綁定列是相對較快的。