2017-03-25 49 views
2

我跑了檢查,看看我的Tensorflow安裝是否正在使用使用示例代碼從Tensorflow說明我的GPU hereTensorflow:不一致的GPU識別

當我運行的代碼,第一次,我得到這個輸出:

$ python gpu-test.py 

出來:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GRID K520 
major: 3 minor: 0 memoryClockRate (GHz) 0.797 
pciBusID 0000:00:03.0 
Total memory: 3.94GiB 
Free memory: 3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
Device mapping: 
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0 
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping: 
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0 

MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0 
b: (Const): /job:localhost/replica:0/task:0/gpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/gpu:0 
a: (Const): /job:localhost/replica:0/task:0/gpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/gpu:0 
[[ 22. 28.] 
[ 49. 64.]] 

它使用GPU,都好!有了這個確定性,我推出了一個巨型CNN的Jupyter筆記本電腦並且訓練它,而且速度非常慢。

我很困惑,第二次運行gpu-test.py。這一次,即使在此期間沒有任何變化,我得到不同的輸出:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-19-90 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-19-90 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.39.0 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016 
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 
""" 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0 
E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 367.57.0 does not match DSO version 375.39.0 -- cannot find working devices in this configuration 
Device mapping: no known devices. 
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping: 

MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0 
b: (Const): /job:localhost/replica:0/task:0/cpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/cpu:0 
a: (Const): /job:localhost/replica:0/task:0/cpu:0 
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/cpu:0 
[[ 22. 28.] 
[ 49. 64.]] 

我現在完全困惑。

在我運行GPU測試第一次和第二次之間發生的唯一兩件事情是(1)我解壓文件和(2)我跑Jupyter筆記本。 沒有任何已安裝,更新或無論如何改變了我的系統。

任何人都可以幫忙嗎?

爲什麼這種情況正在發生突然的,當它沒有發生提前5分鐘:

kernel version 367.57.0 does not match DSO version 375.39.0 

我怎樣才能升級內核版本?

回答

1

我已經發現了什麼情況:作爲無人值守更新在後臺運行的自動驅動程序更新嘗試將驅動程序更新到版本375.39.0。

但是,對於此驅動程序版本,AWS g2.2xlarge實例上的GRID K520 GPU太老。

嘗試自動更新會使系統處於不一致狀態並將其全部分解。

對我來說唯一的方法是啓動一個新的AWS實例並在啓動後立即終止更新過程以保持系統不受影響。非常討厭的問題:/。

如果有人碰巧有同樣的問題:通過鍵入top到終端

  • 檢查是否有

    • 啓動一個新的AWS G2實例
    • SSH自己馬上
    • 顯示正在運行的進程一個繁忙的過程說「無人看管....」「如果是,複製其PID(進程ID)
    • kill -9 PID殺了它,纔可以嘗試安裝更新
  • 1

    這意味着您需要將您的cuda驅動程序更新到最新版本。不確定不一致的地方可能來自哪裏。

    +0

    我發現出了什麼事:自動更新驅動程序在後臺運行的無人蔘與更新嘗試將驅動程序更新至版本375.39.0。但是,AWS g2.2xlarge實例上的GRID K520 GPU對於此驅動程序版本來說太舊了。出於某種性能上的原因,此嘗試的自動更新會使系統處於不一致的狀態,並且打破了這一切,唯一的辦法就是啓動一個新的AWS實例,並在啓動後立即終止更新過程,以保持系統完好無損,非常令人討厭的問題:/。感謝您的幫助! – Alex