我假設您可以將整個數據集加載到RAM中的一個numpy數組中,並且您正在使用Linux或Mac。 (如果你使用的是Windows系統,或者你不能將陣列安裝到RAM中,那麼你應該將陣列複製到磁盤上的一個文件中並使用numpy.memmap來訪問它。你的計算機會將數據從磁盤緩存到RAM中可以,並且這些緩存將在進程之間共享,因此它不是一個可怕的解決方案。)
根據上述假設,如果您需要對通過multiprocessing
創建的其他進程的數據集進行只讀訪問,則可以簡單地創建數據集,然後啓動其他進程。他們將只能訪問原始名稱空間中的數據。它們可以改變原始名稱空間的數據,但這些更改對其他進程不可見(內存管理器會將它們改變的每一段內存複製到本地內存映射中)。
如果其他進程需要改變原始數據集,並以父進程或其他進程是可見的這些變化,你可以使用這樣的事情:
import multiprocessing
import numpy as np
# create your big dataset
big_data = np.zeros((3, 3))
# create a shared-memory wrapper for big_data's underlying data
# (it doesn't matter what datatype we use, and 'c' is easiest)
# I think if lock=True, you get a serialized object, which you don't want.
# Note: you will need to setup your own method to synchronize access to big_data.
buf = multiprocessing.Array('c', big_data.data, lock=False)
# at this point, buf and big_data.data point to the same block of memory,
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason
# changes aren't propagated between them unless you do the following:
big_data.data = buf
# now you can update big_data from any process:
def add_one_direct():
big_data[:] = big_data + 1
def add_one(a):
# People say this won't work, since Process() will pickle the argument.
# But in my experience Process() seems to pass the argument via shared
# memory, so it works OK.
a[:] = a+1
print "starting value:"
print big_data
p = multiprocessing.Process(target=add_one_direct)
p.start()
p.join()
print "after add_one_direct():"
print big_data
p = multiprocessing.Process(target=add_one, args=(big_data,))
p.start()
p.join()
print "after add_one():"
print big_data