2017-02-12 125 views
1

我正在運行一些TensorFlow代碼,恢復並重新開始從檢查點進行培訓。每當我從CPU構建恢復它似乎工作得很好。但是,如果我嘗試恢復時,我用gpu運行我的代碼它似乎無法正常工作。尤其是我得到的錯誤:爲什麼TensorFlow恢復檢查點內存不足,但原始腳本不會?

Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session. 
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615 

我看到它說我運行內存,但是當我增加內存說10GBs它並沒有真正改變任何東西。這隻會發生在我的GPU構建,因爲CPU恢復完美。

無論如何,有什麼想法或開始的想法可能會造成這種情況?

gpu的實質上是自動分配的,所以我不太清楚可能是什麼原因造成的,或者甚至是調試的起始步驟。


完全錯誤:

E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615 
Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session. 

回答

0

Tensorflow來自物理和虛擬內存給你幾乎無限的內存來操縱你的型號CPU的使用效益。調試的第一步是通過簡單地刪除一些權重/圖層並在GPU上運行來構建較小的模型,以確保您沒有編碼錯誤。然後緩慢增加圖層/權重,直到您再次耗盡內存。這將確認您在GPU上有內存問題。我建議最初在GPU上構建你的圖形,就像你知道它在稍後訓練時適合它一樣。如果您需要大圖,請考慮將圖的部分分配給不同的GPU(如果有)。

+0

不知道這是否重要,但我有一個for循環,我建立不同的圖形。所以我測試說3個模型,首先我訓練第一個,然後是第二個,然後是最後一個。可能是錯誤的原因? –

+0

很有可能「默認情況下,TensorFlow映射幾乎所有的GPU內存」,因此您需要確保您正確配置會話。 https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth –

+0

per_process_gpu_memory_fraction是你可能想要的。 config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config = config,...) –

相關問題