1臺帶有2個GPU的PC。在2個GPU上訓練2個獨立的CNN。我用以下爲GPU圖形創建:Tensorflow如何在2個GPU上訓練2個CNN(獨立)。 CUDA_ERROR_OUT_OF_MEMORY錯誤
with tf.device('/gpu:%d' % self.single_gpu):
self._create_placeholders()
self._build_conv_net()
self._create_cost()
self._creat_optimizer()
培訓循環不是下th.device()
1日開始CNN訓練過程中,如使用GPU 1.在此之後,我開始第二CNN培訓與GPU 0.我總是得到CUDA_ERROR_OUT_OF_MEMORY錯誤,並且無法開始第二個訓練過程。
在同一臺PC上運行分配給2個GPU的2個獨立訓練任務可能嗎?如果可能的話,我錯過了什麼?
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 164.06M (172032000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W¯¯tensorflow /型芯/ common_runtime/bfc_allocator.cc:274] ******* ____ ****************** _______________________________________________________________________ W¯¯tensorflow /core/common_runtime/bfc_allocator.cc:275]試圖分配384.00MiB的內存不足。查看記錄狀態的日誌。 回溯(最近一次通話最後): 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py」,行1022 ,在_do_call中 return fn(* args) 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py」,行1004,在_run_fn 狀態,run_metadata) 文件 「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/contextlib.py」,行89,在出口 下一個(自我。 gen) 文件「/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py」,第466行,在raise_exception_on_not_ok_status中 pywrap_tensorflow.TF_GetCode (ST atos)) tensorflow.python.framework.errors_impl.InternalError:Dst張量未初始化。 [[node:_recv_inputs/input_placeholder_0/_7 = _Recvclient_terminated = false,recv_device =「/ job:localhost/replica:0/task:0/gpu:2」,send_device =「/ job:localhost/replica:0/task: 0/cpu:0「,send_device_incarnation = 1,tensor_name =」edge_3__recv_inputs/input_placeholder_0「,tensor_type = DT_FLOAT,_device =」/ job:localhost/replica:0/task:0/gpu:2「]] [[Node: mean/_15 = _Recvclient_terminated = false,recv_device =「/ job:localhost/replica:0/task:0/cpu:0」,send_device =「/ job:localhost/replica:0/task:0/gpu:2」, send_device_incarnation = 1,tensor_name = 「edge_414_Mean」,tensor_type = DT_FLOAT,_device = 「/作業:本地主機/複製:0 /任務:0/CPU:0」]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mg_model_nvidia_gpu.py", line 491, in <module>
main()
File "mg_model_nvidia_gpu.py", line 482, in main
nvidia_cnn.train(data_generator, train_data, val_data)
File "mg_model_nvidia_gpu.py", line 307, in train
self.keep_prob: self.train_config.keep_prob})
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: _recv_inputs/input_placeholder_0/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]()]]
[[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
因此,過程1使用GPU 0: \t os.environ [ 「CUDA_DEVICE_ORDER」] = 「PCI_BUS_ID」 \t os.environ [ 「CUDA_VISIBLE_DEVICES」] = 「0」 \t tf.device('/ GPU :0'): 進程2使用GPU 3: \t os。ENVIRON [ 「CUDA_DEVICE_ORDER」] = 「PCI_BUS_ID」 \t os.environ [ 「CUDA_VISIBLE_DEVICES」] = 「3」 \t tf.device( '/ GPU:3'): 使用2 GPU需要? – user6101147
1個培訓過程,2個並行批量數據到2個GPU的任務如何?與上面相同的配置還是不是? – user6101147
我認爲在這種情況下,您需要使用tensorflow分佈式,以便訓練過程和2個任務彼此交談。 – npf