2013-04-26 20 views
0

在具有4個NVIDIA GPU的節點上,我在設備0上啓用了ECC內存保護(所有其他設備都禁用了ECC)。由於我在設備0上啓用了ECC,因此當我嘗試在此設備上創建上下文0(驅動程序API)時,我的應用程序(僅使用一個設備的CUDA)會掛起。我不知道爲什麼它在這一點上掛起。如果我使用不同的設備將CUDA_VISIBLE_DEVICE設置爲相應的其他設備,則工作正常。它必須與啓用ECC相關。有什麼想法嗎? 這裏的nvidia-smi輸出: (爲什麼會報告99%的揮發性GPU的利用率,沒有什麼是運行在那裏?)無法在啓用了ECC的NVIDIA設備上創建上下文

+------------------------------------------------------+      
| NVIDIA-SMI 4.304.54 Driver Version: 304.54   |      
|-------------------------------+----------------------+----------------------+ 
| GPU Name      | Bus-Id  Disp. | Volatile Uncorr. ECC | 
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage   | GPU-Util Compute M. | 
|===============================+======================+======================| 
| 0 Tesla K20m    | 0000:02:00.0  Off |     1 | 
| N/A 29C P0 49W/225W | 0% 12MB/4799MB |  99%  Default | 
+-------------------------------+----------------------+----------------------+ 
| 1 Tesla K20m    | 0000:03:00.0  Off |     0 | 
| N/A 22C P8 15W/225W | 0% 12MB/4799MB |  0%  Default | 
+-------------------------------+----------------------+----------------------+ 
| 2 Tesla K20m    | 0000:83:00.0  Off |     0 | 
| N/A 22C P8 24W/225W | 0% 11MB/4799MB |  0%  Default | 
+-------------------------------+----------------------+----------------------+ 
| 3 Tesla K20m    | 0000:84:00.0  Off |     0 | 
| N/A 23C P8 25W/225W | 0% 11MB/4799MB |  0%  Default | 
+-------------------------------+----------------------+----------------------+ 

+-----------------------------------------------------------------------------+ 
| Compute processes:            GPU Memory | 
| GPU  PID Process name          Usage  | 
|=============================================================================| 
| No running compute processes found           | 
+-----------------------------------------------------------------------------+ 

編輯:nvidia-smi -a報告在所有設備上啓用ECC。奇怪!

==============NVSMI LOG============== 

Timestamp      : Fri Apr 26 10:18:14 2013 
Driver Version     : 304.54 

Attached GPUs     : 4 
GPU 0000:02:00.0 
    Product Name    : Tesla K20m 
    Display Mode    : Disabled 
    Persistence Mode   : Enabled 
    Driver Model 
     Current     : N/A 
     Pending     : N/A 
    Serial Number    : 0324512044699 
    VBIOS Version    : 80.10.11.00.0B 
    Inforom Version 
     Image Version   : 2081.0208.01.07 
     OEM Object    : 1.1 
     ECC Object    : 3.0 
     Power Management Object : N/A 
    GPU Operation Mode 
     Current     : Compute 
     Pending     : Compute 
    PCI 
     Bus      : 0x02 
     Device     : 0x00 
     Domain     : 0x0000 
     Device Id    : 0x102810DE 
     Bus Id     : 0000:02:00.0 
     Sub System Id   : 0x101510DE 
     GPU Link Info 
      PCIe Generation 
       Max    : 2 
       Current   : 2 
      Link Width 
       Max    : 16x 
       Current   : 16x 
    Fan Speed     : N/A 
    Performance State   : P0 
    Clocks Throttle Reasons 
     Idle     : Not Active 
     User Defined Clocks  : Not Active 
     SW Power Cap   : Not Active 
     HW Slowdown    : Not Active 
     Unknown     : Not Active 
    Memory Usage 
     Total     : 4799 MB 
     Used     : 12 MB 
     Free     : 4787 MB 
    Compute Mode    : Default 
    Utilization 
     Gpu      : 99 % 
     Memory     : 0 % 
    Ecc Mode 
     Current     : Enabled 
     Pending     : Enabled 
    ECC Errors 
     Volatile 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 1 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 1 
     Aggregate 
      Single Bit    
       Device Memory : 1 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 1 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
    Temperature 
     Gpu      : 29 C 
    Power Readings 
     Power Management  : Supported 
     Power Draw    : 49.51 W 
     Power Limit    : 225.00 W 
     Default Power Limit  : 225.00 W 
     Min Power Limit   : 150.00 W 
     Max Power Limit   : 225.00 W 
    Clocks 
     Graphics    : 758 MHz 
     SM      : 758 MHz 
     Memory     : 2600 MHz 
    Applications Clocks 
     Graphics    : 705 MHz 
     Memory     : 2600 MHz 
    Max Clocks 
     Graphics    : 758 MHz 
     SM      : 758 MHz 
     Memory     : 2600 MHz 
    Compute Processes   : None 

GPU 0000:03:00.0 
    Product Name    : Tesla K20m 
    Display Mode    : Disabled 
    Persistence Mode   : Enabled 
    Driver Model 
     Current     : N/A 
     Pending     : N/A 
    Serial Number    : 0324512044821 
    VBIOS Version    : 80.10.11.00.0B 
    Inforom Version 
     Image Version   : 2081.0208.01.07 
     OEM Object    : 1.1 
     ECC Object    : 3.0 
     Power Management Object : N/A 
    GPU Operation Mode 
     Current     : Compute 
     Pending     : Compute 
    PCI 
     Bus      : 0x03 
     Device     : 0x00 
     Domain     : 0x0000 
     Device Id    : 0x102810DE 
     Bus Id     : 0000:03:00.0 
     Sub System Id   : 0x101510DE 
     GPU Link Info 
      PCIe Generation 
       Max    : 2 
       Current   : 1 
      Link Width 
       Max    : 16x 
       Current   : 16x 
    Fan Speed     : N/A 
    Performance State   : P8 
    Clocks Throttle Reasons 
     Idle     : Active 
     User Defined Clocks  : Not Active 
     SW Power Cap   : Not Active 
     HW Slowdown    : Not Active 
     Unknown     : Not Active 
    Memory Usage 
     Total     : 4799 MB 
     Used     : 12 MB 
     Free     : 4787 MB 
    Compute Mode    : Default 
    Utilization 
     Gpu      : 0 % 
     Memory     : 0 % 
    Ecc Mode 
     Current     : Enabled 
     Pending     : Enabled 
    ECC Errors 
     Volatile 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
     Aggregate 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
    Temperature 
     Gpu      : 19 C 
    Power Readings 
     Power Management  : Supported 
     Power Draw    : 15.22 W 
     Power Limit    : 225.00 W 
     Default Power Limit  : 225.00 W 
     Min Power Limit   : 150.00 W 
     Max Power Limit   : 225.00 W 
    Clocks 
     Graphics    : 324 MHz 
     SM      : 324 MHz 
     Memory     : 324 MHz 
    Applications Clocks 
     Graphics    : 705 MHz 
     Memory     : 2600 MHz 
    Max Clocks 
     Graphics    : 758 MHz 
     SM      : 758 MHz 
     Memory     : 2600 MHz 
    Compute Processes   : None 

GPU 0000:83:00.0 
    Product Name    : Tesla K20m 
    Display Mode    : Disabled 
    Persistence Mode   : Enabled 
    Driver Model 
     Current     : N/A 
     Pending     : N/A 
    Serial Number    : 0324512044783 
    VBIOS Version    : 80.10.11.00.0B 
    Inforom Version 
     Image Version   : 2081.0208.01.07 
     OEM Object    : 1.1 
     ECC Object    : 3.0 
     Power Management Object : N/A 
    GPU Operation Mode 
     Current     : Compute 
     Pending     : Compute 
    PCI 
     Bus      : 0x83 
     Device     : 0x00 
     Domain     : 0x0000 
     Device Id    : 0x102810DE 
     Bus Id     : 0000:83:00.0 
     Sub System Id   : 0x101510DE 
     GPU Link Info 
      PCIe Generation 
       Max    : 2 
       Current   : 1 
      Link Width 
       Max    : 16x 
       Current   : 16x 
    Fan Speed     : N/A 
    Performance State   : P8 
    Clocks Throttle Reasons 
     Idle     : Active 
     User Defined Clocks  : Not Active 
     SW Power Cap   : Not Active 
     HW Slowdown    : Not Active 
     Unknown     : Not Active 
    Memory Usage 
     Total     : 4799 MB 
     Used     : 11 MB 
     Free     : 4788 MB 
    Compute Mode    : Default 
    Utilization 
     Gpu      : 0 % 
     Memory     : 0 % 
    Ecc Mode 
     Current     : Enabled 
     Pending     : Enabled 
    ECC Errors 
     Volatile 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
     Aggregate 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
    Temperature 
     Gpu      : 22 C 
    Power Readings 
     Power Management  : Supported 
     Power Draw    : 24.74 W 
     Power Limit    : 225.00 W 
     Default Power Limit  : 225.00 W 
     Min Power Limit   : 150.00 W 
     Max Power Limit   : 225.00 W 
    Clocks 
     Graphics    : 324 MHz 
     SM      : 324 MHz 
     Memory     : 324 MHz 
    Applications Clocks 
     Graphics    : 705 MHz 
     Memory     : 2600 MHz 
    Max Clocks 
     Graphics    : 758 MHz 
     SM      : 758 MHz 
     Memory     : 2600 MHz 
    Compute Processes   : None 

GPU 0000:84:00.0 
    Product Name    : Tesla K20m 
    Display Mode    : Disabled 
    Persistence Mode   : Enabled 
    Driver Model 
     Current     : N/A 
     Pending     : N/A 
    Serial Number    : 0324512044628 
    VBIOS Version    : 80.10.11.00.0B 
    Inforom Version 
     Image Version   : 2081.0208.01.07 
     OEM Object    : 1.1 
     ECC Object    : 3.0 
     Power Management Object : N/A 
    GPU Operation Mode 
     Current     : Compute 
     Pending     : Compute 
    PCI 
     Bus      : 0x84 
     Device     : 0x00 
     Domain     : 0x0000 
     Device Id    : 0x102810DE 
     Bus Id     : 0000:84:00.0 
     Sub System Id   : 0x101510DE 
     GPU Link Info 
      PCIe Generation 
       Max    : 2 
       Current   : 1 
      Link Width 
       Max    : 16x 
       Current   : 16x 
    Fan Speed     : N/A 
    Performance State   : P8 
    Clocks Throttle Reasons 
     Idle     : Active 
     User Defined Clocks  : Not Active 
     SW Power Cap   : Not Active 
     HW Slowdown    : Not Active 
     Unknown     : Not Active 
    Memory Usage 
     Total     : 4799 MB 
     Used     : 11 MB 
     Free     : 4788 MB 
    Compute Mode    : Default 
    Utilization 
     Gpu      : 0 % 
     Memory     : 0 % 
    Ecc Mode 
     Current     : Enabled 
     Pending     : Enabled 
    ECC Errors 
     Volatile 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
     Aggregate 
      Single Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
      Double Bit    
       Device Memory : 0 
       Register File : 0 
       L1 Cache  : 0 
       L2 Cache  : 0 
       Texture Memory : 0 
       Total   : 0 
    Temperature 
     Gpu      : 23 C 
    Power Readings 
     Power Management  : Supported 
     Power Draw    : 25.47 W 
     Power Limit    : 225.00 W 
     Default Power Limit  : 225.00 W 
     Min Power Limit   : 150.00 W 
     Max Power Limit   : 225.00 W 
    Clocks 
     Graphics    : 324 MHz 
     SM      : 324 MHz 
     Memory     : 324 MHz 
    Applications Clocks 
     Graphics    : 705 MHz 
     Memory     : 2600 MHz 
    Max Clocks 
     Graphics    : 758 MHz 
     SM      : 758 MHz 
     Memory     : 2600 MHz 
    Compute Processes   : None 
+0

打開ECC後重新啓動了嗎?運行'nvidia-smi'可以在其中一個GPU上產生「幻影」利用。 – 2013-04-26 14:09:39

+0

發出'nvidia-smi -i 0 --ecc-config = 1'啓用了ECC,並要求重啓才能生效。是的,我重新啓動節點 – ritter 2013-04-26 14:10:33

+0

您可以在設備0(或所有設備)上運行bandwidthTest cuda示例嗎? – 2013-04-26 14:12:06

回答

4

nvidia-smi輸出在設備上顯示不可糾正的ECC錯誤。您可以使用nvidia-smi --reset-ecc-errors=0 -g 0來重置錯誤並重試。復位中的0表示僅復位易失性計數器,聚合計數器仍將指示過去發生了錯誤。

如果您看到設備的更多錯誤,那麼值得進一步調查原因。

請注意,在摘要視圖中,您正在查看的ECC字段實際上是「易失性不可糾正ECC」,即錯誤計數不是ECC啓用/禁用標誌。如果ECC被禁用,則會顯示「N/A」。

+0

重置錯誤計數後是否需要重新啓動節點? – ritter 2013-04-26 14:34:11

+0

工作!感謝你的回答! – ritter 2013-04-26 14:43:09

相關問題