2016-10-04 104 views
1

我在EC2 g2(NVIDIA GRID K520)實例上運行https://github.com/tensorflow/models/blob/master/resnet/resnet_main.py的resnet模型,並看到OOM錯誤。我嘗試過去除使用GPU的代碼的各種組合,前綴CUDA_VISIBLE_DEVICES ='0'並且還將batch_size減少到64.我仍然無法開始訓練。你能幫我嗎?運行resnet model tensorflow的OOM錯誤

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **********************x***************************************************************************xx W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 196.00MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:936] Resource exhausted: OOM when allocating tensor with shape[64,16,224,224] E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[64,16,224,224] [[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]] [[Node: train_step/update/_1561 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]] Traceback (most recent call last): File "./resnet_main.py", line 203, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "./resnet_main.py", line 197, in main train(hps) File "./resnet_main.py", line 82, in train feed_dict={model.lrn_rate: lrn_rate}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[64,16,224,224] [[Node: unit_1_2/sub1/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](unit_1_2/residual_only_activation/leaky_relu, unit_1_2/sub1/conv1/DW/read)]] [[Node: train_step/update/_1561 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10115_train_step/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]] Caused by op u'unit_1_2/sub1/conv1/Conv2D', defined at: File "./resnet_main.py", line 203, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv)) File "./resnet_main.py", line 197, in main train(hps) File "./resnet_main.py", line 64, in train model.build_graph() File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 59, in build_graph self._build_model() File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 94, in _build_model x = res_func(x, filters[1], filters[1], self._stride_arr(1), False) File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 208, in _residual x = self._conv('conv1', x, 3, in_filter, out_filter, stride) File "/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py", line 279, in _conv return tf.nn.conv2d(x, kernel, strides, padding='SAME') File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d data_format=data_format, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in init self._traceback = _extract_stack()

+0

顯然,它仍在使用GPU「use_cudnn_on_gpu = true,_device =」/ job:localhost/replica:0/task:0/gpu:0「 您可以使用batch_size爲1運行嗎?因爲這個模型太大而且內存消耗太多,你可以看看這個GPU有多少內存?你也可以設置標誌num_gpus 0在CPU上運行 –

回答

0

NVIDIA GRID K520有8GB內存(link)。我已成功在具有12GB內存的NVIDIA GPU上訓練ResNet模型。正如錯誤所示,TensorFlow會嘗試將所有網絡權重放入GPU內存並失敗。我相信你有幾個選擇:

  • 只在CPU上訓練,如評論所述,假設你的CPU有超過8GB的內存。這不被推薦。
  • 使用更少的參數訓練不同的網絡。自Resnet以來已經發布了幾個網絡,如Inception-v4, Inception-ResNet,參數較少,準確度相當。這個選項沒有花費任何嘗試!
  • 購買更多內存的GPU。最簡單的選擇,如果你有錢。
  • 購買另一個具有相同內存的GPU,並將網絡的下半部分在一個上,並將網絡的上半部分在另一個上。 GPU之間進行通信的困難使得該選項不太理想。

我希望這可以幫助你和其他類似的內存問題。

+0

他也可以減少批量的大小。它? –