2016-08-02 26 views
5

我在使用slurm(http://slurm.schedmd.com/)工作負載管理器時遇到此錯誤。當我運行一些tensorflow python腳本時,有時會導致錯誤(附加)。它似乎無法找到安裝的cuda庫,但我正在運行不需要GPU的腳本。因此,我覺得爲什麼cuda會成爲一個問題,這讓我很困惑。如果我不需要它,爲什麼cuda安裝是一個問題?爲什麼在slurm中的作業是TensorFlow腳本時無限期凍結?

我從SLURM-JOB_ID文件得到的唯一有用信息是以下幾點:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
""" 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine. 

我一直以爲tensorflow將不需要GPU。所以我假設最後一個錯誤說沒有GPU不會導致錯誤(糾正我,如果我錯了)。

我不明白爲什麼我需要CUDA庫。我試圖用GPU運行我的作業,如果我的作業是CPU作業,爲什麼我需要cuda庫?


我試圖登錄到節點直接和啓動tensorflow,但我沒有明顯的錯誤:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 

雖然我預計錯誤:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
""" 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine. 

我也在張量流庫中做了官方的git問題:

https://github.com/tensorflow/tensorflow/issues/3632

+1

回答「爲什麼會這樣?」:來自slurm環境內的張量流不能找到libcuda.so:'libcuda報告的版本是:未找到:找不到libcuda.so' –

+0

@RobertCrovella因此錯誤不是由於'libcuda報告的版本是:未找到:無法找到libcuda.so'我一直認爲,如果它找不到GPU,它就不會使用它,這沒關係。 –

+0

做了一個官方的git問題,看看有人可以幫我解決這個問題:https://github.com/tensorflow/tensorflow/issues/3632 –

回答

1

在通過批處理作業提交slurm時,張量運行存在一些錯誤。

目前我通過在slurm上運行srun來繞過它。

它也出現在您的案例中,您安裝了tensorflow的GPU版本,並在沒有GPU的機器上運行它。這是你的情況造成的另一個錯誤。

+0

你是什麼意思,你正在運行srun?你介意澄清這一點嗎?不幸的是,我需要一次運行大約30個腳本,這是不行的。我想我和GPU一起卡住了(它確實卡住了,但更少)。 –

+0

「srun - 空bash」會給你一個互動會話。當他們修復它或我發現它背後的原因時,我會發布它,但我所知道的是,提交sbatch作業存在一個錯誤。 – Steven

+0

因此,當你運行srun和bash時,你會運行這些工作,並且所有內容都按預期運行? (就像更新一樣,它有時也會卡在GPU上) –

0

我一直有一個類似的問題,並且我把它寫到了將模型寫入光澤文件系統時的保存程序掛起。儘管如此,仍然在等待一個真正的解