我有一個計算我必須做,這是有點昂貴,我想產生多個進程來完成它。要點或多或少:在多個進程之間共享一個只讀scipy稀疏矩陣的安全
1)我有一個大的scipy.sparse.csc_matrix
(可以使用其他稀疏格式,如果需要的話),我將從中讀取(只讀,從不寫)數據進行計算。
2)我必須做很多令人尷尬的並行計算和返回值。
所以我做了這樣的事情:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
我的問題是:這是安全的?我的功能f
從不寫任何東西,但我沒有采取任何預防措施來分享進程之間的big_array,只是傳遞它。它似乎正在工作,但我擔心,如果我可以通過在多個進程之間傳遞值來損壞任何東西,即使我從不寫入它。
我試圖使用sharemem包來共享多個進程之間的矩陣,但它似乎無法保存scipy稀疏矩陣,只有正常的numpy數組。
如果這不安全,我怎樣才能在沒有問題的進程之間共享(只讀)大型稀疏矩陣?
我看到here,我可以再拍csc_matrix指向具有相同的內存:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
這將使它更安全還是會作爲傳遞原始對象相同的一樣安全?
謝謝。
這顯然可以共享CSC矩陣的三個定義元素。但是,是否可以在每個線程中重建合適的csc_matrix對象而不復制共享數據? – gerowam