如何安全地終止在多個GPU上運行的張量流程程序

我已經實現了一個使用張量流的網絡。該網絡在4個GPU上進行了培訓。當我點擊ctrl + c時，程序崩潰了nvidia驅動程序並創建了名爲「python」的殭屍進程。我無法殺死殭屍進程，我也不能通過sudo reboot重新啓動Ubuntu系統。如何安全地終止在多個GPU上運行的張量流程程序

我正在使用FIFO隊列和線程從二進制文件讀取數據。

coord = tf.train.Coordinator() 
t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord)) 
t.start()

我打電話sess.close()後，程序將不會停止，我看到：

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=4033 evicted_count=3000 eviction_rate=0.743863 and unsatisfied allocation rate=0 
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 0 get requests, put_count=14033 evicted_count=13000 eviction_rate=0.926388 and unsatisfied allocation rate=0

看來GPU資源不會被釋放。如果我打開另一個終端，nvidia-smi命令將不起作用。然後，我必須通過慘遭重啓系統：

#echo 1 > /proc/sys/kernel/sysrq 
#echo b > /proc/sysrq-trigger

我知道sess.close可能是太殘酷。所以我試着用dequeue操作清空FIFO隊列，然後：

while iteration < 10000: 
    GPU training... 

#training finished 

coord.request_stop() 
while sess.run(queue_size) > 0: 
    sess.run(dequeue_one_element_op) 
    print('queue_size='+str(sess.run(get_queue_size_op))) 
    time.sleep(1) 
coord.join([t]) 
print('finished join t')

這個方法也不行。基本上，程序在達到最大訓練迭代後不能終止。

來源

2016-01-21 read Read

你找到解決這個問題？我甚至不使用FIFO隊列或單獨的線程，仍然有這個問題。 – Adi

@Adi號我最終沒有使用多個GPU。 :( –

https://github.com/tensorflow/tensorflow/issues/658

這解決了這個問題：

export CUDA_VISIBLE_DEVICES=0

來源

2016-03-12 01:25:43

實際上這並不能解決問題，你的方法會限制程序只使用一個GPU，但我想加快多個GPU –

如何安全地終止在多個GPU上運行的張量流程程序

回答

相關問題