2015-12-28 40 views
1

我有一個包含計數的壓縮稀疏行矩陣。我想建立一個包含這些計數的預期頻率的矩陣。這是我目前使用的代碼:從稀疏矩陣計數中構建預期頻率矩陣的更快方法

from scipy.sparse import coo_matrix 

#m is a csr_matrix 

col_total = m.sum(axis=0) 
row_total = m.sum(axis=1) 
n = int(col_total.sum(axis=1)) 
A = coo_matrix(m) 

for i,j in zip(A.row,A.col): 
    m[i,j]= col_total.item(j)*row_total.item(i)/n 

這對小矩陣很好。在更大的矩陣(> 1Gb)上,for循環需要幾天才能運行。有什麼辦法可以讓這個更快嗎?

+0

是'row_total.item(我)'應該是在分母? – user2357112

+0

我沒有看到這個計算應該如何產生預期的頻率。 – user2357112

+0

你說得對,應該是col_total.item(j)* row_total.item(i)/ n 我現在要編輯。 – kormak

回答

1

要擴大@hpaulj的回答一點,你可以直接從期望頻率和該行創建輸出矩陣擺脫for循環非零元素的m /列索引:

from scipy import sparse 
import numpy as np 

def fast_efreqs(m): 

    col_total = np.array(m.sum(axis=0)).ravel() 
    row_total = np.array(m.sum(axis=1)).ravel() 

    # I'm casting this to an int for consistency with your version, but it's 
    # not clear to me why you would want to do this... 
    grand_total = int(col_total.sum()) 

    ridx, cidx = m.nonzero()   # indices of non-zero elements in m 
    efreqs = row_total[ridx] * col_total[cidx]/grand_total 

    return sparse.coo_matrix((efreqs, (ridx, cidx))) 

爲了便於比較,這裏是你的原代碼的函數:

01在一個小矩陣
def orig_efreqs(m): 

    col_total = m.sum(axis=0) 
    row_total = m.sum(axis=1) 
    n = int(col_total.sum(axis=1)) 

    A = sparse.coo_matrix(m) 
    for i,j in zip(A.row,A.col): 
     m[i,j]= col_total.item(j)*row_total.item(i)/n 

    return m 

相等測試:在一個更大的矩陣

m = sparse.rand(100, 100, density=0.1, format='csr') 
print((orig_efreqs(m.copy()) != fast_efreqs(m)).nnz == 0) 
# True 

基準性能:

In [1]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr') 
    .....: orig_efreqs(m) 
    .....: 
1 loops, best of 3: 2min 25s per loop 

In [2]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr') 
    .....: fast_efreqs(m) 
    .....: 
10 loops, best of 3: 38.3 ms per loop 
2

m.data = (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0]是一種完全矢量化的計算m.data的方法。它可能會被清理一下。 col_totalmatrix,因此執行逐元素乘法需要一些額外的語法。

我將演示:

In [37]: m=sparse.rand(10,10,.1,'csr') 
In [38]: col_total=m.sum(axis=0) 
In [39]: row_total=m.sum(axis=1) 
In [40]: n=int(col_total.sum(axis=1)) 

In [42]: A=m.tocoo() 

In [46]: for i,j in zip(A.row,A.col): 
    ....:   m[i,j]= col_total.item(j)*row_total.item(i)/n 
    ....:  

In [49]: m.data 
Out[49]: 
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632]) 

In [51]: col_total[:,A.col].A*(row_total[A.row,:].T.A)/n 
Out[51]: 
array([[ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632]]) 

In [53]: (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0] 
Out[53]: 
array([ 0.39490171, 0.64246488, 0.19310878, 0.13847277, 0.2018023 , 
     0.008504 , 0.04387622, 0.10903026, 0.37976005, 0.11414632])