R - 在非常大的稀疏矩陣中更新列

我需要更新稀疏矩陣中的某些列，但操作時間過長，以至於無法完成。R - 在非常大的稀疏矩陣中更新列

我有一個少於3M行和1500列左右的稀疏矩陣。我也有一個相同數量的行的數據框，但只有10列。我想用data.frame中的值更新矩陣中的某些列索引。

我用正常矩陣做這件事沒有問題，但是當用稀疏矩陣嘗試它時，甚至需要一個單獨的列。

以下是我正在使用的代碼，需要更改哪些內容纔能有效運行？

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

for (i in 1:5){ 
    x[,var_nums[i]] <- df[,i] 
}

來源

2017-08-31 Nate Thompson

我能得到它完成下使用Matrix::cBind功能1秒，通過消除for循環。

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

var_nums <- sample(1:1559,size = 5) 

t <- Sys.time() 
x   <- x[,-var_nums] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
Sys.time()-t

Time difference of 0.541054 secs

WITH ORDER PRESERVED （靜止不到1秒鐘！）

library(Matrix) 

x <- Matrix(0, nrow = 2678748, ncol = 1559, sparse = TRUE) 
df <- data.frame(replicate(5,sample(0:1,2678748,rep = TRUE))) 

colnames(x) <- paste("col", 1:ncol(x)) 
col.order <- colnames(x) 

cols <- sample(colnames(x),size = 5) 
colnames(df) <- cols 

t <- Sys.time() 
x   <- x[,-which(colnames(x) %in% cols)] 
x   <- Matrix::cBind(x, Matrix::as.matrix(df)) 
x   <- x[,col.order] 
Sys.time()-t 
>  Time difference of 0.550012 secs 

# Proof that order is preserved: 
identical(colnames(x), col.order)

TRUE

來源

2017-08-31 19:06:59

宥

可以使用i，j，x符號的sparseMatrix

library(Matrix) 

# data 
set.seed(1) 
# Changed the dim size to fit in my laptop memory 
nc=10 
nr=100 
n=5 

df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 

#Yours  
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 
for (i in 1:n){ 
    x[,var_nums[i]] <- df[,i] 
} 

# new version 
i = ((which(df==1)-1) %% nr) +1 
j = rep(var_nums, times=colSums(df)) 
y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 

all.equal(x, y, check.attributes=FALSE)

比較速度

f1 <- function(){  
    for (i in 1:n){ 
     x[,var_nums[i]] <- df[,i] 
    } 
    x 
} 

f2 <- function(){ 
    i = ((which(df==1)-1) %% nr) +1 
    j = rep(var_nums, times=colSums(df)) 
    y = sparseMatrix(i=i, j=j, x=1, dims=c(nrow(df), nc)) 
    y 
} 

microbenchmark::microbenchmark(f1(), f2()) 

Unit: milliseconds 
expr  min  lq  mean median  uq  max neval cld 
f1() 4.594229 4.694205 5.010071 4.770475 4.891649 12.666554 100 b 
f2() 1.274745 1.298663 1.464237 1.329534 1.392146 7.153076 100 a

嘗試更大

nc=100 
nr=10000 
n=50 
set.seed(1) 
df <- data.frame(replicate(n,sample(0:1,nr,rep = TRUE))) 
var_nums <- sample(1:nc,size = n) 
x <- Matrix(0, nrow = nr, ncol = nc, sparse = TRUE) 

all.equal(f1(), f2(), check.attributes=FALSE) 

microbenchmark::microbenchmark(f1(), f2(), times=1) 
Unit: milliseconds 
expr   min   lq  mean  median   uq   max neval 
f1() 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251 21605.60251  1 
f2() 60.87275 60.87275 60.87275 60.87275 60.87275 60.87275  1

來源

2017-08-31 19:10:17 user20650

這是略顯繁瑣，但你可以在需要的列綁定在一起像這

Nc = NCOL(x) 

    Matrix(cbind(
    x[, 1:(var_nums[1]-1)], 
    df[, 1], 
    x[, (var_nums[1]+1):(var_nums[2]-1)], 
    df[, 2], 
    x[, (var_nums[2]+1):(var_nums[3]-1)], 
    df[, 3], 
    x[, (var_nums[3]+1):(var_nums[4]-1)], 
    df[, 4], 
    x[, (var_nums[4]+1):(var_nums[5]-1)], 
    df[, 5], 
    x[, (var_nums[5]+1):Nc]), 
    sparse = TRUE)

當df只有5列插入時，這並不算太壞。如果df有更多或者不同數量的列，那麼不同的語法可能更合適。無論如何，綁定列是相對較快的。

來源

2017-08-31 20:10:05 dww

R - 在非常大的稀疏矩陣中更新列

回答

相關問題