1
我正在運行一些TensorFlow代碼,恢復並重新開始從檢查點進行培訓。每當我從CPU構建恢復它似乎工作得很好。但是,如果我嘗試恢復時,我用gpu運行我的代碼它似乎無法正常工作。尤其是我得到的錯誤:爲什麼TensorFlow恢復檢查點內存不足,但原始腳本不會?
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
我看到它說我運行內存,但是當我增加內存說10GBs它並沒有真正改變任何東西。這隻會發生在我的GPU構建,因爲CPU恢復完美。
無論如何,有什麼想法或開始的想法可能會造成這種情況?
gpu的實質上是自動分配的,所以我不太清楚可能是什麼原因造成的,或者甚至是調試的起始步驟。
完全錯誤:
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
不知道這是否重要,但我有一個for循環,我建立不同的圖形。所以我測試說3個模型,首先我訓練第一個,然後是第二個,然後是最後一個。可能是錯誤的原因? –
很有可能「默認情況下,TensorFlow映射幾乎所有的GPU內存」,因此您需要確保您正確配置會話。 https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth –
per_process_gpu_memory_fraction是你可能想要的。 config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config = config,...) –