在SciPy sparse.lil_matrix中有效設置行嗎？

我有大約200.000的維度稀疏向量。我也有一個具有相同數量的列的矩陣，以及與向量數量相同的行數。我想以增量方式將所有這些設置爲矩陣，也就是說，應將第一個向量設置爲第一行，依此類推。在SciPy sparse.lil_matrix中有效設置行嗎？

當前，矩陣和向量的類型爲scipy.sparse.lil_matrix。載體開始使用以下功能設置爲矩陣的特定行：

In [7]: us.get_utterance_representation('here is a sentence') 
Out[7]: 
<1x188796 sparse matrix of type '<type 'numpy.float64'>' 
    with 22489 stored elements in Compressed Sparse Row format> 

def set_row_vector(self, row, rowvector): 
    self.matrix[row] = rowvector[0] 

for row, utterance in enumerate(utterances): 
    uvector = self.get_utterance_representation(utterance) 
    self.utterancematrix.add_row_vector(row, uvector)

哪裏uvector是維1X〜200.000的lil_matrix。

以這種方式創建矩陣結果是非常低效，其中一個單獨的文本字符串（話語）需要長達5秒。縱觀剖析，我得出的結論是，將矢量設置爲矩陣中的一行是主要問題。

 55  def set_row_vector(self, row, rowvector): 
     2564609 function calls (2564606 primitive calls) in 5.046 seconds 

    Ordered by: internal time 

    ncalls tottime percall cumtime percall filename:lineno(function) 
    22489 1.397 0.000 1.397 0.000 {numpy.core.multiarray.where} 
    22489 0.783 0.000 2.188 0.000 csr.py:281(_get_single_element) 
    44978 0.365 0.000 0.916 0.000 stride_tricks.py:35(broadcast_arrays) 
    44978 0.258 0.000 0.413 0.000 stride_tricks.py:22(as_strided) 
    202490 0.244 0.000 0.244 0.000 {numpy.core.multiarray.array} 
    22489 0.199 0.000 2.221 0.000 lil.py:280(__setitem__) 
    44978 0.174 0.000 0.399 0.000 sputils.py:171(_unpack_index) 
    584777 0.171 0.000 0.171 0.000 {isinstance} 
    44988 0.170 0.000 0.230 0.000 sputils.py:115(isintlike) 
    67467 0.166 0.000 0.278 0.000 sputils.py:196(_check_boolean) 
    22489 0.154 0.000 0.647 0.000 sputils.py:215(_index_to_arrays) 
     1 0.129 0.129 5.035 5.035 dsm_classes.py:55(set_row_vector) 
    22489 0.120 0.000 0.171 0.000 lil.py:247(_insertat2) 
    67467 0.102 0.000 0.102 0.000 {method 'ravel' of 'numpy.ndarray' objects}

我的問題是，有沒有更好的方法來完成從話語中創建矩陣？

（謝謝）

來源

2014-01-30 Jimmy C

首先，我認爲你的uvector實際上是在CSR格式，而不是律。這可能是最好的，但是：

In [38]: %timeit matrix[0] = row[0] 
10 loops, best of 3: 104 ms per loop 

In [39]: %timeit matrix[0] = row 
10 loops, best of 3: 68.7 ms per loop

最後，該解決方案的真正的肉是爲了避免：

In [30]: import scipy.sparse as ss 

In [31]: row = ss.rand(1,5000,0.1,'csr') 

In [32]: matrix = ss.lil_matrix((30,5000)) 

In [33]: %timeit matrix[0] = row 
10 loops, best of 3: 65.6 ms per loop 

In [34]: row_lil = row.tolil() 

In [35]: %timeit matrix[0] = row_lil 
10 loops, best of 3: 93.4 ms per loop

接下來，您可以通過刪除您rowvector的[0]標避免一些成本儘可能的LIL格式。雖然它是最靈活的格式，但它也是最慢的（通常）。舉例來說，如果你只是想在一個時間來建立你的矩陣一行，您可以使用scipy.sparse.vstack：

In [40]: %%timeit 
    ....: for i in xrange(matrix.shape[0]): 
    ....: matrix[i] = row 
    ....: 
1 loops, best of 3: 3.14 s per loop 

In [41]: %timeit ss.vstack([row for i in xrange(matrix.shape[0])]) 
1000 loops, best of 3: 1.46 ms per loop 

In [44]: m2 = ss.vstack([row for i in xrange(matrix.shape[0])]) 

In [45]: numpy.allclose(matrix.todense(), m2.todense()) 
Out[45]: True

編輯：如果記憶是一個問題，但您仍想最大速度，可以使您自己的vstack基於fast vstack for CSR matrices。我首先複製_compressed_sparse_stack函數，然後用您的CSR行列表和axis = 0來調用它。然後，你應該可以修改它以獲取迭代器而不是列表，這將避免高內存開銷。或者，您可以將步驟嵌入到for循環中。無論哪種方式，您都會失去一點速度，但可能會節省大量內存。

來源

2014-01-30 18:09:43 perimosocordiae

謝謝你的幫助！這的確顯着加快了性能，不幸的是我遇到了這樣的內存問題。通過使用vstack，甚至沒有40 GB的內存能夠成功適應整個過程。你不會有任何進一步的想法嗎？ –

看到我上面的編輯。我沒有嘗試過我的建議，但我認爲它應該可行。 – perimosocordiae

謝謝，我會試一試！ –

在SciPy sparse.lil_matrix中有效設置行嗎？

回答

相關問題