2017-07-30 112 views
1

1臺帶有2個GPU的PC。在2個GPU上訓練2個獨立的CNN。我用以下爲GPU圖形創建:Tensorflow如何在2個GPU上訓練2個CNN(獨立)。 CUDA_ERROR_OUT_OF_MEMORY錯誤

with tf.device('/gpu:%d' % self.single_gpu): 
    self._create_placeholders() 
    self._build_conv_net() 
    self._create_cost() 
    self._creat_optimizer() 

培訓循環不是下th.device()

1日開始CNN訓練過程中,如使用GPU 1.在此之後,我開始第二CNN培訓與GPU 0.我總是得到CUDA_ERROR_OUT_OF_MEMORY錯誤,並且無法開始第二個訓練過程。

在同一臺PC上運行分配給2個GPU的2個獨立訓練任務可能嗎?如果可能的話,我錯過了什麼?

E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 164.06M (172032000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 

W¯¯tensorflow /型芯/ common_runtime/bfc_allocator.cc:274] ******* ____ ****************** _______________________________________________________________________ W¯¯tensorflow /core/common_runtime/bfc_allocator.cc:275]試圖分配384.00MiB的內存不足。查看記錄狀態的日誌。 回溯(最近一次通話最後): 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py」,行1022 ,在_do_call中 return fn(* args) 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py」,行1004,在_run_fn 狀態,run_metadata) 文件 「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/contextlib.py」,行89,在出口 下一個(自我。 gen) 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py」,第466行,在raise_exception_on_not_ok_status中 pywrap_tensorflow.TF_GetCode (ST atos)) tensorflow.python.framework.errors_impl.InternalError:Dst張量未初始化。 [[node:_recv_inputs/input_placeholder_0/_7 = _Recvclient_terminated = false,recv_device =「/ job:localhost/replica:0/task:0/gpu:2」,send_device =「/ job:localhost/replica:0/task: 0/cpu:0「,send_device_incarnation = 1,tensor_name =」edge_3__recv_inputs/input_placeholder_0「,tensor_type = DT_FLOAT,_device =」/ job:localhost/replica:0/task:0/gpu:2「]] [[Node: mean/_15 = _Recvclient_terminated = false,recv_device =「/ job:localhost/replica:0/task:0/cpu:0」,send_device =「/ job:localhost/replica:0/task:0/gpu:2」, send_device_incarnation = 1,tensor_name = 「edge_414_Mean」,tensor_type = DT_FLOAT,_device = 「/作業:本地主機/複製:0 /任務:0/CPU:0」]]

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "mg_model_nvidia_gpu.py", line 491, in <module> 
    main() 
    File "mg_model_nvidia_gpu.py", line 482, in main 
    nvidia_cnn.train(data_generator, train_data, val_data) 
    File "mg_model_nvidia_gpu.py", line 307, in train 
    self.keep_prob: self.train_config.keep_prob}) 
    File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run 
    run_metadata_ptr) 
    File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run 
    feed_dict_string, options, run_metadata) 
    File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run 
    target_list, options, run_metadata) 
    File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call 
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized. 
    [[Node: _recv_inputs/input_placeholder_0/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]()]] 
    [[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] 

回答

1

缺省情況下,TensorFlow預分配它可以訪問的GPU設備的整個內存。因此沒有內存可用於第二個進程。

您可以使用config.gpu_options控制這種分配:

config = tf.ConfigProto() 
config.gpu_options.per_process_gpu_memory_fraction = 0.4 
sess = tf.Session(config=config) as sess: 

,或者你可以通過使用屬性os.environ["CUDA_VISIBLE_DEVICES"]你的兩個過程不同的卡。

+0

因此,過程1使用GPU 0: \t os.environ [ 「CUDA_DEVICE_ORDER」] = 「PCI_BUS_ID」 \t os.environ [ 「CUDA_VISIBLE_DEVICES」] = 「0」 \t tf.device('/ GPU :0'): 進程2使用GPU 3: \t os。ENVIRON [ 「CUDA_DEVICE_ORDER」] = 「PCI_BUS_ID」 \t os.environ [ 「CUDA_VISIBLE_DEVICES」] = 「3」 \t tf.device( '/ GPU:3'): 使用2 GPU需要? – user6101147

+0

1個培訓過程,2個並行批量數據到2個GPU的任務如何?與上面相同的配置還是不是? – user6101147

+0

我認爲在這種情況下,您需要使用tensorflow分佈式,以便訓練過程和2個任務彼此交談。 – npf

相關問題