我想遍歷一個CSR矩陣的行和列的總和除以每個元素,類似這樣的位置:大NumPy的SciPy的CSR矩陣,行明智的操作
我的問題是我正在處理一個大矩陣:(96582,350138)
並且當從鏈接的帖子應用該操作時,由於返回的矩陣是密集的,所以它擴大了我的記憶。
所以這是我第一次嘗試:
for row in counts:
row = row/row.sum()
不幸的是,這並不影響基質可言,所以我想出了第二個想法,以創建一個新的CSR矩陣和CONCAT行使用vstack:
from scipy import sparse
import time
start_time = curr_time = time.time()
mtx = sparse.csr_matrix((0, counts.shape[1]))
for i, row in enumerate(counts):
prob_row = row/row.sum()
mtx = sparse.vstack([mtx, prob_row])
if i % 1000 == 0:
delta_time = time.time() - curr_time
total_time = time.time() - start_time
curr_time = time.time()
print('step: %i, total time: %i, delta_time: %i' % (i, total_time, delta_time))
這種運作良好,但一些迭代之後它變得越來越慢:
step: 0, total time: 0, delta_time: 0
step: 1000, total time: 1, delta_time: 1
step: 2000, total time: 5, delta_time: 4
step: 3000, total time: 12, delta_time: 6
step: 4000, total time: 23, delta_time: 11
step: 5000, total time: 38, delta_time: 14
step: 6000, total time: 55, delta_time: 17
step: 7000, total time: 88, delta_time: 32
step: 8000, total time: 136, delta_time: 47
step: 9000, total time: 190, delta_time: 53
step: 10000, total time: 250, delta_time: 59
step: 11000, total time: 315, delta_time: 65
step: 12000, total time: 386, delta_time: 70
step: 13000, total time: 462, delta_time: 76
step: 14000, total time: 543, delta_time: 81
step: 15000, total time: 630, delta_time: 86
step: 16000, total time: 722, delta_time: 92
step: 17000, total time: 820, delta_time: 97
有何建議?任何想法爲什麼vstack變得越來越慢?
見https://stackoverflow.com/a/45339754和https://stackoverflow.com/q/44080315 – hpaulj
與密集數組一樣,循環中的重複級聯很慢。在列表中累積結果並執行'vstack'更快。 – hpaulj