從稀疏矩陣計數中構建預期頻率矩陣的更快方法

我有一個包含計數的壓縮稀疏行矩陣。我想建立一個包含這些計數的預期頻率的矩陣。這是我目前使用的代碼：從稀疏矩陣計數中構建預期頻率矩陣的更快方法

from scipy.sparse import coo_matrix 

#m is a csr_matrix 

col_total = m.sum(axis=0) 
row_total = m.sum(axis=1) 
n = int(col_total.sum(axis=1)) 
A = coo_matrix(m) 

for i,j in zip(A.row,A.col): 
    m[i,j]= col_total.item(j)*row_total.item(i)/n

這對小矩陣很好。在更大的矩陣（> 1Gb）上，for循環需要幾天才能運行。有什麼辦法可以讓這個更快嗎？

來源

2015-12-28 kormak

是'row_total.item（我）'應該是在分母？ – user2357112

我沒有看到這個計算應該如何產生預期的頻率。 – user2357112

你說得對，應該是col_total.item（j）* row_total.item（i）/ n 我現在要編輯。 – kormak

要擴大@hpaulj的回答一點，你可以直接從期望頻率和該行創建輸出矩陣擺脫for循環非零元素的m /列索引：

from scipy import sparse 
import numpy as np 

def fast_efreqs(m): 

    col_total = np.array(m.sum(axis=0)).ravel() 
    row_total = np.array(m.sum(axis=1)).ravel() 

    # I'm casting this to an int for consistency with your version, but it's 
    # not clear to me why you would want to do this... 
    grand_total = int(col_total.sum()) 

    ridx, cidx = m.nonzero()   # indices of non-zero elements in m 
    efreqs = row_total[ridx] * col_total[cidx]/grand_total 

    return sparse.coo_matrix((efreqs, (ridx, cidx)))

爲了便於比較，這裏是你的原代碼的函數：

01在一個小矩陣

def orig_efreqs(m): 

    col_total = m.sum(axis=0) 
    row_total = m.sum(axis=1) 
    n = int(col_total.sum(axis=1)) 

    A = sparse.coo_matrix(m) 
    for i,j in zip(A.row,A.col): 
     m[i,j]= col_total.item(j)*row_total.item(i)/n 

    return m

相等測試：在一個更大的矩陣

m = sparse.rand(100, 100, density=0.1, format='csr') 
print((orig_efreqs(m.copy()) != fast_efreqs(m)).nnz == 0) 
# True

基準性能：

In [1]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr') 
    .....: orig_efreqs(m) 
    .....: 
1 loops, best of 3: 2min 25s per loop 

In [2]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr') 
    .....: fast_efreqs(m) 
    .....: 
10 loops, best of 3: 38.3 ms per loop

來源

2015-12-28 18:48:21

m.data = (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0]是一種完全矢量化的計算m.data的方法。它可能會被清理一下。 col_total是matrix，因此執行逐元素乘法需要一些額外的語法。

我將演示：

In [37]: m=sparse.rand(10,10,.1,'csr') 
In [38]: col_total=m.sum(axis=0) 
In [39]: row_total=m.sum(axis=1) 
In [40]: n=int(col_total.sum(axis=1)) 

In [42]: A=m.tocoo() 

In [46]: for i,j in zip(A.row,A.col): 
    ....:   m[i,j]= col_total.item(j)*row_total.item(i)/n 
    ....:  

In [49]: m.data 
Out[49]: 
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632]) 

In [51]: col_total[:,A.col].A*(row_total[A.row,:].T.A)/n 
Out[51]: 
array([[ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632]]) 

In [53]: (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0] 
Out[53]: 
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632])

來源

2015-12-28 18:15:56 hpaulj

從稀疏矩陣計數中構建預期頻率矩陣的更快方法

回答

相關問題