Tensorflow在C++ API中加載模型並獲取「from device：CUDA_ERROR_OUT_OF_MEMORY」錯誤

我的模型約爲2.4GB。在我的推理步驟中，我想要在每個GPU中通過多處理方法加載模型。這意味着我嘗試在一個GPU中創建兩個進程，並分別裝載一個模型。完成每個會話的配置後，每個會話都獲得大約5GB的內存，但我仍然遇到「來自設備：CUDA_ERROR_OUT_OF_MEMORY」。我想知道。 ..尋求幫助Tensorflow在C++ API中加載模型並獲取「from device：CUDA_ERROR_OUT_OF_MEMORY」錯誤

GPU信息：

[搜索@ qrwt01 /家庭/ S /應用/ qtfserverd /斌] $ NVIDIA-SMI 週四9月14日21時42分48秒2017年

+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 375.26 Driver Version: 375.26 | 
|-------------------------------+----------------------+----------------------+ 
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 
|===============================+======================+======================| 
| 0 Tesla K80 Off | 0000:08:00.0 Off | 0 | 
| N/A 48C P0 61W/149W | 11366MiB/11439MiB | 0% Default | 
+-------------------------------+----------------------+----------------------+ 
| 1 Tesla K80 Off | 0000:09:00.0 Off | 0 | 
| N/A 32C P0 72W/149W | 11359MiB/11439MiB | 0% Default | 
+-------------------------------+----------------------+----------------------+ 

+-----------------------------------------------------------------------------+ 
| Processes: GPU Memory | 
| GPU PID Type Process name Usage | 
|=============================================================================| 
| 0 33056 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5823MiB | 
| 0 33057 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5515MiB | 
| 1 33058 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5823MiB | 
| 1 33059 C ...ome/s/apps/qtfserverd/etc/qtfserverd.conf 5516MiB | 
+-----------------------------------------------------------------------------+

會話配置：

void* create_session(void* graph, std::string& checkpoint_path, 
    int intra_op_threads, int inter_op_threads, std::string& device_list) { 
Session* session = NULL; 
SessionOptions sess_opts; 
//int NUM_THREADS = 8; 
if (intra_op_threads > 0) { 
    sess_opts.config.set_intra_op_parallelism_threads(intra_op_threads); 
} 
if (inter_op_threads > 0) { 
    sess_opts.config.set_inter_op_parallelism_threads(inter_op_threads); 
} 

sess_opts.config.set_allow_soft_placement(true); 
sess_opts.config.mutable_gpu_options()->set_visible_device_list(device_list); 
sess_opts.config.mutable_gpu_options()->set_allocator_type("BFC"); 
sess_opts.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.5); 
sess_opts.config.mutable_gpu_options()->set_allow_growth(true); 
Status status = NewSession(sess_opts, &session); 
if (!status.ok()) { 
    fprintf(stderr, "Create Session Failed %s\n", status.ToString().c_str()); 
    return NULL; 
}

錯誤信息

負載/home/search/tensorflow/deploy_combine.model.meta圖表/ GPU：1次成功 2017年9月14日21：42：31.188212：我tensorflow /core/common_runtime/gpu/gpu_device.cc:965]找到具有屬性的設備0：名稱：特斯拉K80主要：3次要：7 memoryClockRate（GHz）：0.8235 pciBusID：0000：09：00.0 totalMemory：11.17GiB freeMemory ：11.05GiB 2017-09-14 21:42：31.188260：tensorflow/core/common_runtime/gpu/gpu_device.cc：1055]創建TensorFlow設備（/設備：GPU：0） - >（設備：1，名稱：Tesla K80，PCI總線ID：0000：09.0，計算能力：3.7） qss_switch：1，lstm_switch：1 qss_switch：1 ，lstm_switch：1 2017-09-14 21：42：33.826598：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能分配來自設備的1.58G（1701773312字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21： 42：33.838694：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能從設備分配1.43G（1531596032字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42：33.893832：E tensorflow/stream_executor/cuda/cuda_driver .cc：936]未能從設備分配439.82M（461180672字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42：33.903917：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能分配439.82M（461180672字節）設備：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21：42：33.913843：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能分配來自設備的439.82M（461180672字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42 ：33.924008：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能分配來自設備的439.82M（461180672字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42：33.935385：E tensorflow/stream_executor/cuda/cuda_driver。 cc：936]未能分配來自設備的439.82M（461180672字節）：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42：33.946556：E tensorflow/stream_executor/cuda/cuda_driver.cc：936]未能分配439.82M（461180672字節）from device：CUDA_ERROR_OUT_OF_MEMORY 2017-09-14 21:42：33.956340：E tensorflow/stream_executor/cuda/cuda_driver。

來源

2017-09-15 suzhaolong

嘗試減少操作的參數或者批量執行計算，因爲錯誤指示所有GPU資源都已耗盡。

來源

2018-03-06 10:58:56

Tensorflow在C++ API中加載模型並獲取「from device：CUDA_ERROR_OUT_OF_MEMORY」錯誤

回答

相關問題