Python共享讀取內存

我正在處理一個大〜8GB的數據集，我也使用scikit-learn來訓練它上面的各種ML模型。該數據集基本上是一列一維的整數矢量。Python共享讀取內存

如何使數據集可用於多個python進程，或者如何編碼數據集以便我可以使用multiprocessing的類？我一直在閱讀，我也一直在閱讀multiprocessing的文檔，但我很困惑。我只需要讓數據可讀到每個過程，以便我可以用它來訓練模型。

我需要將共享的multiprocessing變量作爲ctypes嗎？

如何將數據集表示爲？

來源

2016-08-07 Georgi Georgiev

我假設您可以將整個數據集加載到RAM中的一個numpy數組中，並且您正在使用Linux或Mac。（如果你使用的是Windows系統，或者你不能將陣列安裝到RAM中，那麼你應該將陣列複製到磁盤上的一個文件中並使用numpy.memmap來訪問它。你的計算機會將數據從磁盤緩存到RAM中可以，並且這些緩存將在進程之間共享，因此它不是一個可怕的解決方案。）

根據上述假設，如果您需要對通過multiprocessing創建的其他進程的數據集進行只讀訪問，則可以簡單地創建數據集，然後啓動其他進程。他們將只能訪問原始名稱空間中的數據。它們可以改變原始名稱空間的數據，但這些更改對其他進程不可見（內存管理器會將它們改變的每一段內存複製到本地內存映射中）。

如果其他進程需要改變原始數據集，並以父進程或其他進程是可見的這些變化，你可以使用這樣的事情：

import multiprocessing 
import numpy as np 

# create your big dataset 
big_data = np.zeros((3, 3)) 

# create a shared-memory wrapper for big_data's underlying data 
# (it doesn't matter what datatype we use, and 'c' is easiest) 
# I think if lock=True, you get a serialized object, which you don't want. 
# Note: you will need to setup your own method to synchronize access to big_data. 
buf = multiprocessing.Array('c', big_data.data, lock=False) 

# at this point, buf and big_data.data point to the same block of memory, 
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason 
# changes aren't propagated between them unless you do the following: 
big_data.data = buf 

# now you can update big_data from any process: 
def add_one_direct(): 
    big_data[:] = big_data + 1 

def add_one(a): 
    # People say this won't work, since Process() will pickle the argument. 
    # But in my experience Process() seems to pass the argument via shared 
    # memory, so it works OK. 
    a[:] = a+1 

print "starting value:" 
print big_data 

p = multiprocessing.Process(target=add_one_direct) 
p.start() 
p.join() 

print "after add_one_direct():" 
print big_data 

p = multiprocessing.Process(target=add_one, args=(big_data,)) 
p.start() 
p.join() 

print "after add_one():" 
print big_data

來源

2016-08-08 20:49:13