如何在大型稀疏矩陣中組合具有相同名稱的列

我從Matrix包中獲得了一個稀疏dgTMatrix，它已經拾取了一些重複的colnames。我想通過將相同名稱的列相加來組合這些列，形成縮小矩陣。如何在大型稀疏矩陣中組合具有相同名稱的列

我發現了this post，我修改了稀疏矩陣操作。但是：在大型物體上它仍然非常慢。我想知道如果有人有更好的解決方案，直接對稀疏矩陣的索引元素進行操作會更快。例如，[email protected]索引（從零開始）的[email protected][[2]]中的標籤可以被壓縮並用於重新索引[email protected]。（注：這就是爲什麼我用三重稀疏矩陣形式，而不是因爲搞清楚這p值使我的頭不疼每次矩陣默認列稀疏矩陣）。

require(Matrix) 

# set up a (triplet) sparseMatrix 
A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2), j = 1:6, x = rep(1:3, 2), 
        giveCsparse = FALSE, 
        dimnames = list(paste0("r", 1:2), rep(letters[1:3], 2))) 
A 
## 2 x 6 sparse Matrix of class "dgTMatrix" 
## a b c a b c 
## r1 1 . 3 . 2 . 
## r2 . 2 . 1 . 3 

str(A) 
## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots 
## [email protected] i  : int [1:6] 0 1 0 1 0 1 
## [email protected] j  : int [1:6] 0 1 2 3 4 5 
## [email protected] Dim  : int [1:2] 2 6 
## [email protected] Dimnames:List of 2 
## .. ..$ : chr [1:2] "r1" "r2" 
## .. ..$ : chr [1:6] "a" "b" "c" "a" ... 
## [email protected] x  : num [1:6] 1 2 3 1 2 3 
## [email protected] factors : list() 

# my matrix-based attempt 
OP1 <- function(x) { 
    nms <- colnames(x) 
    if (any(duplicated(nms))) 
     x <- x %*% Matrix(sapply(unique(nms),"==", nms)) 
    x 
} 
OP1(A) 
## 2 x 3 sparse Matrix of class "dgCMatrix" 
## a b c 
## r1 1 2 3 
## r2 1 2 3

它工作得很好，但在我打算使用它的巨大稀疏對象上似乎很慢。這裏有一個大項目：

# now something bigger, for testing 
set.seed(10) 
nr <- 10000  # rows 
nc <- 26*100 # columns - 100 repetitions of a-z 
nonZeroN <- round(nr * nc/3) # two-thirds sparse 
B <- sparseMatrix(i = sample(1:nr, size = nonZeroN, replace = TRUE), 
        j = sample(1:nc, size = nonZeroN, replace = TRUE), 
        x = round(runif(nonZeroN)*5+1), 
        giveCsparse = FALSE, 
        dimnames = list(paste0("r", 1:nr), rep(letters, nc/26))) 
print(B[1:5, 1:10], col.names = TRUE) 
## 5 x 10 sparse Matrix of class "dgTMatrix" 
##  a b c d e f g h i j 
## r1 . . 5 . . 2 . . . . 
## r2 . . . . . . . . . 4 
## r4 . . . . . . . 3 3 . 
## r3 2 2 . 3 . . . 3 . . 
## r5 3 . . 1 . . . . . 5 

require(microbenchmark) 
microbenchmark(OPmatrixCombine1 = OP1(B), times = 30) 
## Unit: milliseconds 
##    expr  min  lq  mean median  uq  max neval 
## OPmatrixCombine1 578.9222 619.3912 665.6301 631.4219 646.2716 1013.777 30

有沒有更好的辦法，在這裏更好的手段更快和，如果可能的話，不需要額外的大型物體的建設？

來源

2016-04-21 Ken Benoit

不是更快（至少對於只有幾個獨特的「colnames」），並且也和你一樣：（b），（b），唯一的（colnames（B））），x = 1L）'避免創建'sapply（唯一的（colnames（ B）），「==」，colnames（B））'「矩陣」，然後傳遞給「矩陣」變得稀疏。 –

這是一個嘗試使用我想到的索引重新索引，我找到了一個朋友的幫助（Patrick是你？）。它重新索引j值，並使用sparseMatrix()的非常方便的功能，將x值添加到索引位置相同的元素。

OP2 <- function(x) { 
    nms <- colnames(x) 
    uniquenms <- unique(nms) 
    # build the sparseMatrix again: x's with same index values are automatically 
    # added together, keeping in mind that indexes stored from 0 but built from 1 
    sparseMatrix(i = [email protected] + 1, 
       j = match(nms, uniquenms)[[email protected] + 1], 
       x = [email protected], 
       dimnames = list(rownames(x), uniquenms), 
       giveCsparse = FALSE) 
}

結果都是一樣的：

OP2(A) 
## 2 x 3 sparse Matrix of class "dgCMatrix" 
## a b c 
## r1 1 2 3 
## r2 1 2 3 

all.equal(as(OP1(B), "dgTMatrix"), OP2(B)) 
## [1] TRUE

但速度更快：

require(microbenchmark) 
microbenchmark(OPmatrixCombine1 = OP1(B), 
       OPreindexSparse = OP2(B), 
       times = 30) 
## Unit: relative 
##    expr  min  lq  mean median  uq  max neval 
## OPmatrixCombine1 1.756769 1.307651 1.360487 1.341814 1.346864 1.460626 30 
## OPreindexSparse 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 30

來源

2016-04-22 23:06:46

重疊索引的列表確實非常方便; （'unique'（x））'方法比'length（字符）'長度（x）''長度（'unique（x））* length你的第一種方法的「sapply」。 –

如何在大型稀疏矩陣中組合具有相同名稱的列

回答

相關問題