2017-09-06 180 views
2

我有25個稀疏矩陣的大列表(他們真的很大 - 其中一個100M或更多的元素),我需要將它們合併成一個大的稀疏矩陣。如何合併大的稀疏矩陣

例如:一個矩陣A可以像這樣(我的真實100M元素的矩陣的它的子矩陣):

> A 
5 x 4 sparse Matrix of class "dgCMatrix" 
       SKU 
CustomerID   404  457  547  558  
    100002_24655  1  .  .  .  
    100003_46919  .  1  1  .  
    100007_46702  .  .  .  .  
    100012_47709  .  .  .  .  
    100013_46132  1  1  1  1 

> dput(A) 
new("dgCMatrix" 
    , i = c(0L, 4L, 1L, 4L, 1L, 4L, 4L) 
    , p = c(0L, 2L, 4L, 6L, 7L) 
    , Dim = c(5L, 4L) 
    , Dimnames = structure(list(CustomerID = c("100002_24655", "100003_46919", 
"100007_46702", "100012_47709", "100013_46132"), SKU = c("404", 
"457", "547", "558")), .Names = c("CustomerID", "SKU" 
)) 
    , x = c(1, 1, 1, 1, 1, 1, 1) 
    , factors = list() 
) 

其他B可以是這樣的:

> B 
7 x 5 sparse Matrix of class "dgCMatrix" 
       SKU 
CustomerID   191  404  558  715  787   
    100002_24655  .  .  .  .  .    
    100007_46702  1  1  1  1  1    
    100012_47709  .  .  1  .  .    
    100013_46132  .  .  .  .  1    
    100014_46400  .  .  .  .  .    
    100014_605414  1  1  1  .  .    
    100014_631294  .  .  1  1  1    

> dput(B) 
new("dgCMatrix" 
    , i = c(1L, 5L, 1L, 5L, 1L, 2L, 5L, 6L, 1L, 6L, 1L, 3L, 6L) 
    , p = c(0L, 2L, 4L, 8L, 10L, 13L) 
    , Dim = c(7L, 5L) 
    , Dimnames = structure(list(CustomerID = c("100002_24655", "100007_46702", 
"100012_47709", "100013_46132", "100014_46400", "100014_605414", 
"100014_631294"), SKU = c("191", "404", "558", "715", 
"787")), .Names = c("CustomerID", "SKU")) 
    , x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 
    , factors = list() 
) 

輸出應該看起來像這樣:(第一部分是第一個矩陣,第二個是第二個矩陣 - 我用空格分開以便更好地查看)

12 x 7 sparse Matrix of class "dgCMatrix"  
      404 457 547 558 191 715 787  
    [1, ]  1 . . . . . .  
    [2, ]  . 1 1 . . . . 
    [3, ]  . . . . . . . 
    [4, ]  . . . . . . . 
    [5, ]  1 1 1 1 . . . 

    [6, ]  . . . . . . . 
    [7, ]  1 . . 1 1 1 1 
    [8, ]  . . . 1 . . . 
    [9, ]  . . . . . . 1 
    [10,]  . . . . . . . 
    [11,]  1 . . 1 1 . . 
    [12,]  . . . 1 . 1 1 

這意味着我想按列名進行合併。那麼我怎麼能合併所有的25稀疏矩陣?

+0

'>升< - 列表(A,B,C,......) > do.call(rbind,l)的' – Sagar

+2

@Sagar矩陣必須有相同的列數,如果你想使用rbind –

+0

@MartinaZapletalová - 我沒有意識到它們的列數有所不同......我的不好。 – Sagar

回答

-1

基於this answer,我們可以擴展這種方法來合併矩陣的任意長度列表這樣

merge.sparse = function(M.list) { 
    A = M.list[[1]] 

    for (B in M.list[[2:length(M.list)]]){ 
    # finding what's missing 
    misA = colnames(B)[!colnames(B) %in% colnames(A)] 
    misB = colnames(A)[!colnames(A) %in% colnames(B)] 

    misAl = as.vector(numeric(length(misA)), "list") 
    names(misAl) = misA 
    misBl = as.vector(numeric(length(misB)), "list") 
    names(misBl) = misB 

    ## adding missing columns to initial matrices 
    An = do.call(cbind, c(A, misAl)) 
    Bn = do.call(cbind, c(B, misBl))[,colnames(An)] 

    # final bind 
    A = rbind(An, Bn) 
    } 
    A 
} 

x = merge.sparse(list(A,B)) 
+0

它看起來不錯,但: **錯誤:節點堆棧溢出** **包裝過程中發生錯誤:節點堆棧溢出** 錯誤發生在'An = do.call(cbind,c(A,misAl))'和B0 = do.call(cbind,c(B,misB1) )[,colnames(An)]' –

+0

我嘗試'An = Reduce(cbind,c(A,misAl))'並且它有效(我現在在一個稀疏矩陣上嘗試它),但是當我嘗試'Bn = Reduce在intI(j,n = x @ Dim [2],dn [[2]],give.dn = FALSE)中的錯誤: 無效的字符索引** –

+0

我只是添加我編輯的代碼以避免此錯誤,並且在M.list [[2:length(M.list)]]中有'B'的問題''但我沒有知道爲什麼。也許我的編輯太複雜了,所以如果你有任何建議,請在下面寫下[this asnwer](https://stackoverflow.com/a/46092893/8416107) –

0

所以我編輯一點點dww answear避免我在評論中提到錯誤。但它有點慢。但我有很大的矩陣。

> proc.time() - ptm 
    user system elapsed 
572.384 213.179 793.550 

這是編輯的代碼:

merge.sparse = function(M.list) { 
    A = M.list[[1]] 

    for (i in 2:length(M.list)){ #i indexes of matrices 
    # finding what's missing 
    misA = colnames(M.list[[i]])[!colnames(M.list[[i]]) %in% colnames(A)] 
    misB = colnames(A)[!colnames(A) %in% colnames(M.list[[i]])] 

    misAl = as.vector(numeric(length(misA)), "list") 
    names(misAl) = misA 
    misBl = as.vector(numeric(length(misB)), "list") 
    names(misBl) = misB 

    ## adding missing columns to initial matrices 
    An = Reduce(cbind, c(A, misAl)) 
    lenA <- ncol(An)-length(misA)+1 
    colnames(An)[lenA:ncol(An)] = names(misAl) 

    Bn = Reduce(cbind, c(M.list[[i]], misBl)) 
    lenB <- ncol(Bn)-length(misB)+1 
    colnames(Bn)[lenB:ncol(Bn)] = names(misBl) 
    Bn <- Bn[,colnames(An)] 

    # final bind 
    A = rbind(An, Bn, use.names = T) 
    print(c(length(M.list), i)) 
    } 
    A 
}