在具有4個NVIDIA GPU的節點上,我在設備0上啓用了ECC內存保護(所有其他設備都禁用了ECC)。由於我在設備0上啓用了ECC,因此當我嘗試在此設備上創建上下文0(驅動程序API)時,我的應用程序(僅使用一個設備的CUDA)會掛起。我不知道爲什麼它在這一點上掛起。如果我使用不同的設備將CUDA_VISIBLE_DEVICE設置爲相應的其他設備,則工作正常。它必須與啓用ECC相關。有什麼想法嗎? 這裏的nvidia-smi
輸出: (爲什麼會報告99%的揮發性GPU的利用率,沒有什麼是運行在那裏?)無法在啓用了ECC的NVIDIA設備上創建上下文
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m | 0000:02:00.0 Off | 1 |
| N/A 29C P0 49W/225W | 0% 12MB/4799MB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m | 0000:03:00.0 Off | 0 |
| N/A 22C P8 15W/225W | 0% 12MB/4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m | 0000:83:00.0 Off | 0 |
| N/A 22C P8 24W/225W | 0% 11MB/4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m | 0000:84:00.0 Off | 0 |
| N/A 23C P8 25W/225W | 0% 11MB/4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
編輯:nvidia-smi -a
報告在所有設備上啓用ECC。奇怪!
==============NVSMI LOG==============
Timestamp : Fri Apr 26 10:18:14 2013
Driver Version : 304.54
Attached GPUs : 4
GPU 0000:02:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044699
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:02:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 99 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Aggregate
Single Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 29 C
Power Readings
Power Management : Supported
Power Draw : 49.51 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:03:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044821
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:03:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 19 C
Power Readings
Power Management : Supported
Power Draw : 15.22 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:83:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044783
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:83:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 22 C
Power Readings
Power Management : Supported
Power Draw : 24.74 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:84:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044628
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:84:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 23 C
Power Readings
Power Management : Supported
Power Draw : 25.47 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
打開ECC後重新啓動了嗎?運行'nvidia-smi'可以在其中一個GPU上產生「幻影」利用。 – 2013-04-26 14:09:39
發出'nvidia-smi -i 0 --ecc-config = 1'啓用了ECC,並要求重啓才能生效。是的,我重新啓動節點 – ritter 2013-04-26 14:10:33
您可以在設備0(或所有設備)上運行bandwidthTest cuda示例嗎? – 2013-04-26 14:12:06