2016-08-10 20 views
0

我試圖在多個CUDA設備上分配作業,其中任何時候正在運行的作業的總數應該小於或等於cpu的數量核心可用。爲此,我確定每個設備上可用「插槽」的數量,並創建一個包含可用插槽的列表。如果我有6個cpu核心和兩個cuda設備(0和1),那麼AVAILABLE_SLOTS = [0,1,0,1,0,1]。在我的工作函數中,我彈出列表並將其保存到一個變量中,在子進程調用中設置CUDA_VISIBLE_DEVICES env var,然後將其附加回列表中。這一直工作到目前爲止,但我想避免競爭條件。使用鎖或管理器列表的Python多處理器訪問全局列表變量的池工作人員

當前代碼如下:

def work(cmd): 
    slot = AVAILABLE_GPU_SLOTS.pop() 
    exit_code = subprocess.call(cmd, shell=False, env=dict(os.environ, CUDA_VISIBLE_DEVICES=str(slot))) 
    AVAILABLE_GPU_SLOTS.append(slot) 
    return exit_code 

if __name__ == '__main__': 
    pool_size = multiprocessing.cpu_count() 
    mols_to_be_run = [name for name in os.listdir(YANK_FILES) if os.path.isdir(os.path.join(YANK_FILES, name))] 
    cmds = build_cmd(mols_to_be_run) 
    cuda = get_cuda_devices() 
    AVAILABLE_GPU_SLOTS = build_available_gpu_slots(pool_size, cuda) 
    pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2,) 
    pool.map(work, cmds) 

我可以簡單地聲明鎖定= multiprocessing.Lock()在同一水平AVAILABLE_GPU_SLOTS,把它放在CMDS,然後裏面的工作()做

with lock: 
    slot = AVAILABLE_GPU_SLOTS.pop() 
# subprocess stuff 
with lock: 
    AVAILABLE_GPU_SLOTS.append(slot) 

還是我需要經理列表。或者,也許有更好的解決方案,我在做什麼。

+0

您是否想過使用信號量和可用插槽列表創建自己的GPU插槽池?實現它作爲一個上下文管理器,以便你可以使用'gpu_slot_pool.get()作爲gpu_slot:' –

+0

哦,如果你使用的是Lock,使用'threading.Lock'對象,而不是多處理。多處理鎖使用命名信號量進行,可能在所有平臺上都不可用。 –

+0

現在至少,這段代碼只能在運行Ubuntu 16.04的主機上運行,​​所以如果沒有真正瞭解細節的任何信息,我會想象它可用。 – jlerche

回答

0

立足掉什麼,我在下面的SO回答Python sharing a lock between processes發現:

使用常規列表導致具有自己的副本每一個過程,如預期。使用經理名單似乎足以解決這個問題。示例代碼:

def doing_work(honk): 
    proc = multiprocessing.current_process() 
    # with lock: 
    #  print proc, 'about to pop SLOTS_LIST', SLOTS_LIST 
    #  slot = SLOTS_LIST.pop() 
    #  print multiprocessing.current_process(), ' just popped', slot, 'from', SLOTS_LIST 
    print proc, 'about to pop SLOTS_LIST', SLOTS_LIST 
    slot = SLOTS_LIST.pop() 
    print multiprocessing.current_process(), ' just popped', slot, 'from SLOTS_LIST' 
    time.sleep(10) 

def init(l): 
    global lock 
    lock = l 

if __name__ == '__main__': 
    man = multiprocessing.Manager() 
    SLOTS_LIST = [1,34,3465,456,4675,6,4] 
    SLOTS_LIST = man.list(SLOTS_LIST) 
    l = multiprocessing.Lock() 
    pool = multiprocessing.Pool(processes=2, initializer=init, initargs=(l,)) 
    inputs = range(len(SLOTS_LIST)) 
    pool.map(doing_work, inputs) 

其輸出

<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675, 6, 4] 
<Process(PoolWorker-3, started daemon)> just popped 4 from SLOTS_LIST 
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675, 6] 
<Process(PoolWorker-2, started daemon)> just popped 6 from SLOTS_LIST 
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456, 4675] 
<Process(PoolWorker-3, started daemon)> just popped 4675 from SLOTS_LIST 
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34, 3465, 456] 
<Process(PoolWorker-2, started daemon)> just popped 456 from SLOTS_LIST 
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1, 34, 3465]  
<Process(PoolWorker-3, started daemon)> just popped 3465 from SLOTS_LIST 
<Process(PoolWorker-2, started daemon)> about to pop SLOTS_LIST [1, 34] 
<Process(PoolWorker-2, started daemon)> just popped 34 from SLOTS_LIST 
<Process(PoolWorker-3, started daemon)> about to pop SLOTS_LIST [1] 
<Process(PoolWorker-3, started daemon)> just popped 1 from SLOTS_LIST 
其期望的行爲

。我不確定它是否完全消除了競爭條件,但似乎足夠好。那並且在它上面使用一個鎖就夠簡單了。