2016-09-19 37 views
1

我試圖運行分佈在多達32臺機器上的Inception v3(https://github.com/tensorflow/models/tree/master/inception)。TensorFlow Out of Memory運行在4臺機器上的Inception v3的錯誤

當我在4臺機器上運行時發現內存不足錯誤。

以下是錯誤:

INFO:tensorflow:Started 0 queues for processing input data. 
E tensorflow/core/client/tensor_c_api.cc:485] OOM when allocating tensor with shape[2048,1001] 
    [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] 
    [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] 
Traceback (most recent call last): 
    File "imagenet_distributed_train.py", line 65, in <module> 
    tf.app.run() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run 
    sys.exit(main(sys.argv)) 
    File "imagenet_distributed_train.py", line 61, in main 
    inception_distributed_train.train(server.target, dataset, cluster_spec) 
    File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 286, in train 
    loss_value, step = sess.run([train_op, global_step]) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run 
    run_metadata_ptr) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run 
    feed_dict_string, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run 
    target_list, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call 
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[2048,1001] 
    [[Node: gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul = Mul[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/gpu:2"](logits/logits/weights/read_S3003, gradients/logits/logits/weights/Regularizer/L2Regularizer/value_grad/tuple/control_dependency_1)]] 
    [[Node: gradients/AddN_48_S3319 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:2", send_device_incarnation=-546941133885931708, tensor_name="edge_17701_gradients/AddN_48", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/cpu:0"]()]] 
Caused by op u'gradients/logits/logits/weights/Regularizer/L2Regularizer/L2Loss_grad/mul', defined at: 
    File "imagenet_distributed_train.py", line 65, in <module> 
    tf.app.run() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run 
    sys.exit(main(sys.argv)) 
    File "imagenet_distributed_train.py", line 61, in main 
    inception_distributed_train.train(server.target, dataset, cluster_spec) 
    File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 215, in train 
    grads = opt.compute_gradients(total_loss) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 229, in compute_gradients 
    return self._opt.compute_gradients(*args, **kwargs) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients 
    colocate_gradients_with_ops=colocate_gradients_with_ops) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 478, in gradients 
    in_grads = _AsList(grad_fn(op, *out_grads)) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 402, in _L2LossGrad 
    return op.inputs[0] * grad 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 754, in binary_op_wrapper 
    return func(x, y, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 903, in _mul_dispatch 
    return gen_math_ops.mul(x, y, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1427, in mul 
    result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op 
    op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__ 
    self._traceback = _extract_stack() 

...which was originally created as op u'logits/logits/weights/Regularizer/L2Regularizer/L2Loss', defined at: 
    File "imagenet_distributed_train.py", line 65, in <module> 
    tf.app.run() 
[elided 1 identical lines from previous traceback] 
    File "imagenet_distributed_train.py", line 61, in main 
    inception_distributed_train.train(server.target, dataset, cluster_spec) 
    File "/home/ubuntu/indu/models/inception/inception/inception_distributed_train.py", line 154, in train 
    logits = inception.inference(images, num_classes, for_training=True) 
    File "/home/ubuntu/indu/models/inception/inception/inception_model.py", line 87, in inference 
    scope=scope) 
    File "/home/ubuntu/indu/models/inception/inception/slim/inception_model.py", line 326, in inception_v3 
    restore=restore_logits) 
    File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args 
    return func(*args, **current_args) 
    File "/home/ubuntu/indu/models/inception/inception/slim/ops.py", line 300, in fc 
    restore=restore) 
    File "/home/ubuntu/indu/models/inception/inception/slim/scopes.py", line 155, in func_with_args 
    return func(*args, **current_args) 
    File "/home/ubuntu/indu/models/inception/inception/slim/variables.py", line 290, in variable 
    trainable=trainable, collections=collections) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 830, in get_variable 
    custom_getter=custom_getter) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 673, in get_variable 
    custom_getter=custom_getter) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 217, in get_variable 
    validate_shape=validate_shape) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 202, in _true_getter 
    caching_device=caching_device, validate_shape=validate_shape) 

我使用EC2實例G2.8XL。這些實例具有:

  1. 的Intel Xeon E5-2670(距橋)處理器
  2. 60 GB存儲器和
  3. 四GK104GL [GRID K520] GPU與它們中的每4 GB存儲器。
  4. 10千兆網卡

我在這些機器上運行Ubuntu 14.04.4 LTS。

我每GPU運行一個工作。所以總共有16名工人。

我在每臺機器上運行一個PS。所以,總共4個PS。

我正在使用8的批處理大小。(4臺機器內存不足,批量大小爲8個。即使批量大小爲2,32臺機器的內存耗盡)。

CUDA和cuDNN的安裝版本:

[email protected]:~$ ls -l /usr/local/cuda/lib64/libcud* 
-rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a 
lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5 
lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18 
-rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18 
-rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a 

我從https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl

[email protected]:~$ python -c "import tensorflow; print(tensorflow.version)" 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
0.10.0rc0 

安裝TensorFlow可能有人請幫助我如何如何解決這一問題,並在集羣32運行盜夢空間V3機器嗎?

更多信息: 下面是我在集羣中的機器上執行命令:

On machine1: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 & 


On machine2: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=5 > /tmp/worker5 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=6 > /tmp/worker6 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=7 > /tmp/worker7 2>&1 & 


On machine3: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=2 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=8 > /tmp/worker8 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=9 > /tmp/worker9 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=10 > /tmp/worker10 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=11 > /tmp/worker11 2>&1 & 


On machine4: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=3 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=12 > /tmp/worker12 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=13 > /tmp/worker13 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=14 > /tmp/worker14 2>&1 & 
python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=15 > /tmp/worker15 2>&1 & 

更新1:

我嘗試了以下實驗:

實驗1:

  • W機器1上的orker1,worker2,worker3和worker4
  • ps1或machine1,machine2上的ps2,machine3上的ps3,machine4上的ps4。

這與4臺機器配置相同,只是四臺機器中的3臺機器的工人被移除。 machine1上的工作負載保持不變。機器1上的通信負載(與四個ps對話)保持不變。 我預計這會耗盡內存,但這工作得很好。

實驗2:

  • Worker1,worker2,worker3和worker4 machine1上。
  • 機器2上的ps1(僅限ps)。

這工作就像魅力和學習率比實驗1

鑑於這一快,我不知道爲什麼4臺機器使用內存耗盡所有四個GPU。

+1

一個建議:嘗試設置'CUDA_VISIBLE_DEVICES = 0'的第一個任務每一臺機器,'CUDA_VISIBLE_DEVICES = 1'上爲第二個任務,等等。這將改變GPU命名(每個工人的任務都會有一個稱爲'/ gpu:0'的GPU設備,對應於單個可見設備),但它應該防止不同的TensorFlow進程互相干擾。 – mrry

+0

@mrry:解決了這個問題。謝謝! 我以前用tf.device('/ job:worker/task:%d'%FLAGS.task_id):'用'tf.device('/ gpu:%d'%gpunum)改變了':in inception_distributed_train.py。 Gpunum是'gpunum = FLAGS.task_id%4'。不知道爲什麼和使用'CUDA_VISIBLE_DEVICES = gpunum'不一樣。 –

+0

我在我的答案中加入了一些理論......我認爲在多個TensorFlow流程共享相同的物理設備時會出現一些問題,但我不是100%會導致失敗的原因。 – mrry

回答

2

正如評論中所述,在每臺機器上設置CUDA_VISIBLE_DEVICES=ii th任務解決了問題。這具有更改GPU命名的作用(因此每個工作任務都有一個名爲"/gpu:0"的GPU設備,對應於該任務中的單個可見設備),但它可以防止同一臺計算機上的不同TensorFlow進程互相干擾。

下列命令應該工作:

# On machine1: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=0 2>&1 & 
CUDA_VISIBLE_DEVICES=0 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=0 > /tmp/worker0 2>&1 & 
CUDA_VISIBLE_DEVICES=1 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=1 > /tmp/worker1 2>&1 & 
CUDA_VISIBLE_DEVICES=2 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=2 > /tmp/worker2 2>&1 & 
CUDA_VISIBLE_DEVICES=3 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=3 > /tmp/worker3 2>&1 & 


# On machine2: 
CUDA_VISIBLE_DEVICES='' python imagenet_distributed_train.py --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=ps --task_id=1 2>&1 & 
CUDA_VISIBLE_DEVICES=0 python imagenet_distributed_train.py --batch_size=8 --data_dir=datadir --ps_hosts=worker1:2222,worker2:2222,worker3:2222,worker4:2222 --worker_hosts=worker1:2230,worker1:2231,worker1:2232,worker1:2233,worker2:2230,worker2:2231,worker2:2232,worker2:2233,worker3:2230,worker3:2231,worker3:2232,worker3:2233,worker4:2230,worker4:2231,worker4:2232,worker4:2233 --job_name=worker --task_id=4 > /tmp/worker4 2>&1 & 
... 

的確切原因並不完全清楚,但有兩個可能性:

  1. 在你的初始設置,每臺機器上的所有四個工作任務爲機器上的每個GPU創建一個設備對象,並且他們可能會嘗試在每個設備上分配4倍的內存。當系統上的所有四個GPU對每個進程都可見時,TensorFlow的佈局器有更多的選擇,並且根據您的設置/培訓計劃,它可能會無意中將ops從兩個工作任務放在同一個GPU上。

2

對於那些在12GB GPU內存的GPU卡上進行調整的模型,4GB GPU內存有點低。小批量大小會降低激活大小,但不會降低參數大小。

一旦你確保有在你的模型內存不必要的使用,你可以嘗試禁用Cudnn CONV臨時存儲器,通過使用

TF_CUDNN_WORKSPACE_LIMIT_IN_MB = 0

禁用在模型中使用臨時內存。你的模型會慢一點,但希望它至少能夠完成輕微的優勢。

+0

感謝您的回答。你可以在這個問題上查看'更新1'。如果參數數量大是問題,那麼這些實驗是否也不會耗盡內存? 鑑於此,我還應該嘗試將TF_CUDNN_WORKSPACE_LIMIT_IN_MB設置爲0.如果是,是否有一種方法可以在不從源編譯的情況下執行此操作? –

+0

TF_CUDNN_WORKSPACE_LIMIT_IN_MB是一個環境變量。你不必從源代碼編譯。我同意不能保證這會解決問題。但無論如何,嘗試一下是個好主意。如果你的記憶力稍差,它可能會奏效。 – zhengxq

相關問題