L2正常化

正如我希望只使用numpy和scipy（我不想用scikit-learn），我想知道如何在一個巨大的SciPy的csc_matrix執行行的L2正常化（2000000 X 500,000）。該操作必須消耗盡可能少的內存，因爲它必須適合內存。L2正常化

我至今是：

import scipy.sparse as sp 

tf_idf_matrix = sp.lil_matrix((n_docs, n_terms), dtype=np.float16) 
# ... perform several operations and fill up the matrix 

tf_idf_matrix = tf_idf_matrix/l2_norm(tf_idf_matrix) 
# l2_norm() is what I want 

def l2_norm(sparse_matrix): 
    pass

來源

2014-03-01 Trein

只需添加：如果其他人不反對scikit-learn，它是從sklearn.preprocessing import normalize;正常化（tf_idf_matrix）'。來自sklearn開發者的無恥插件。 –

因爲我找不到任何地方的答案，我會張貼在這裏我是如何處理的問題。

def l2_norm(sparse_csc_matrix): 
    # first, I convert the csc_matrix to csr_matrix which is done in linear time 
    norm = sparse_csc_matrix.tocsr(copy=True) 

    # compute the inverse of l2 norm of non-zero elements 
    norm.data **= 2 
    norm = norm.sum(axis=1) 
    n_nzeros = np.where(norm > 0) 
    norm[n_nzeros] = 1.0/np.sqrt(norm[n_nzeros]) 
    norm = np.array(norm).T[0] 

    # modify sparse_csc_matrix in place 
    sp.sparsetools.csr_scale_rows(sparse_csc_matrix.shape[0], 
            sparse_csc_matrix.shape[1], 
            sparse_csc_matrix.indptr, 
            sparse_csc_matrix.indices, 
            sparse_csc_matrix.data, norm)

如果有人有更好的方法，請張貼它。

來源

2014-03-01 23:55:18 Trein

這個segfaults在我的盒子上（SciPy 0.12.0，隨機生成的CSC矩陣）。 –

我正在使用SciPy 0.11.0。你能提供代碼來生成隨機的CSC矩陣嗎？ – Trein

'X = np.random.randn（10，4）; sparse_csc_matrix = csc_matrix（X）'。較新的SciPy也有一個'scipy.sparse.rand'工具。 –

回答

相關問題