我跑了檢查,看看我的Tensorflow安裝是否正在使用使用示例代碼從Tensorflow說明我的GPU hereTensorflow:不一致的GPU識別
當我運行的代碼,第一次,我得到這個輸出:
$ python gpu-test.py
出來:
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
b: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/gpu:0
a: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/gpu:0
[[ 22. 28.]
[ 49. 64.]]
它使用GPU,都好!有了這個確定性,我推出了一個巨型CNN的Jupyter筆記本電腦並且訓練它,而且速度非常慢。
我很困惑,第二次運行gpu-test.py
。這一次,即使在此期間沒有任何變化,我得到不同的輸出:
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-19-90
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-19-90
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.39.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 367.57.0 does not match DSO version 375.39.0 -- cannot find working devices in this configuration
Device mapping: no known devices.
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22. 28.]
[ 49. 64.]]
我現在完全困惑。
在我運行GPU測試第一次和第二次之間發生的唯一兩件事情是(1)我解壓文件和(2)我跑Jupyter筆記本。 沒有任何已安裝,更新或無論如何改變了我的系統。
任何人都可以幫忙嗎?
爲什麼這種情況正在發生突然的,當它沒有發生提前5分鐘:
kernel version 367.57.0 does not match DSO version 375.39.0
我怎樣才能升級內核版本?
我發現出了什麼事:自動更新驅動程序在後臺運行的無人蔘與更新嘗試將驅動程序更新至版本375.39.0。但是,AWS g2.2xlarge實例上的GRID K520 GPU對於此驅動程序版本來說太舊了。出於某種性能上的原因,此嘗試的自動更新會使系統處於不一致的狀態,並且打破了這一切,唯一的辦法就是啓動一個新的AWS實例,並在啓動後立即終止更新過程,以保持系統完好無損,非常令人討厭的問題:/。感謝您的幫助! – Alex