爲什麼TensorFlow恢復檢查點內存不足，但原始腳本不會？

我正在運行一些TensorFlow代碼，恢復並重新開始從檢查點進行培訓。每當我從CPU構建恢復它似乎工作得很好。但是，如果我嘗試恢復時，我用gpu運行我的代碼它似乎無法正常工作。尤其是我得到的錯誤：爲什麼TensorFlow恢復檢查點內存不足，但原始腳本不會？

Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session. 
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615

我看到它說我運行內存，但是當我增加內存說10GBs它並沒有真正改變任何東西。這隻會發生在我的GPU構建，因爲CPU恢復完美。

無論如何，有什麼想法或開始的想法可能會造成這種情況？

gpu的實質上是自動分配的，所以我不太清楚可能是什麼原因造成的，或者甚至是調試的起始步驟。

完全錯誤：

E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615 
Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

來源

2017-02-12 Charlie Parker

Tensorflow來自物理和虛擬內存給你幾乎無限的內存來操縱你的型號CPU的使用效益。調試的第一步是通過簡單地刪除一些權重/圖層並在GPU上運行來構建較小的模型，以確保您沒有編碼錯誤。然後緩慢增加圖層/權重，直到您再次耗盡內存。這將確認您在GPU上有內存問題。我建議最初在GPU上構建你的圖形，就像你知道它在稍後訓練時適合它一樣。如果您需要大圖，請考慮將圖的部分分配給不同的GPU（如果有）。

來源

2017-02-12 02:39:20

不知道這是否重要，但我有一個for循環，我建立不同的圖形。所以我測試說3個模型，首先我訓練第一個，然後是第二個，然後是最後一個。可能是錯誤的原因？ –

很有可能「默認情況下，TensorFlow映射幾乎所有的GPU內存」，因此您需要確保您正確配置會話。 https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth –

per_process_gpu_memory_fraction是你可能想要的。 config = tf.ConfigProto（） config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session（config = config，...） –

爲什麼TensorFlow恢復檢查點內存不足，但原始腳本不會？

回答

相關問題