額外的寄存器用法if if

我正在研究一個大的cuda內核，我發現內核每個線程使用43個寄存器。爲了瞭解發生了什麼，我編寫了一個較小的程序來計算註冊使用情況。我注意到，無論何時使用if，註冊使用率都會增加。小代碼如下：額外的寄存器用法if if

#include <limits.h> 
#include <stdio.h> 
#include <fstream> 
#include <iostream> 
#include <cstdlib> 
#include <stdint.h> 

using namespace std; 

__global__ void test_ifs(unsigned int* result){ 
    unsigned int k = 0; 
    for(int j=0;j<MAX_COMP;j++){ 
    //if(j <= threadIdx.x%MAX_COMP){                                                   
     k += j; 
     //}                                                          
    } 
    result[threadIdx.x] = k; 
} 

int main(){ 
    unsigned int* result; 
    cudaError_t e1 = cudaMalloc((void**) &result, THREADSPERBLOCK*sizeof(unsigned int)); 
    if(e1 == cudaSuccess){ 
    test_ifs<<<1, THREADSPERBLOCK>>>(result); 
    cudaError_t e2 = cudaGetLastError(); 
    if(e2 == cudaSuccess){ 
    } 
    else{ 
     cout << "kernel failed to launch" << endl; 
    } 
    } 
    else{ 
    cout << "Failed to allocate results memory" << endl; 
    } 
}

當我編譯該代碼，每個線程使用5個寄存器

ptxas info : Compiling entry function '_Z8test_ifsPj' for 'sm_20' 
ptxas info : Function properties for _Z8test_ifsPj 
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info : Used 5 registers, 40 bytes cmem[0]

但是，如果我去掉if，每個線程使用8個寄存器。任何人都可以向我解釋發生了什麼事？

ptxas info : Compiling entry function '_Z8test_ifsPj' for 'sm_20' 
ptxas info : Function properties for _Z8test_ifsPj 
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info : Used 8 registers, 40 bytes cmem[0]

來源

2012-07-19 gmemon

MAX_COMP的值是多少？看來你的模運算'threadIdx.x％MAX_COMP'沒有被優化，只是在循環中使用一個寄存器。嘗試在循環外移動計算。 – djmj 2012-07-19 22:23:46

MAX_COMP的值是32. – gmemon 2012-07-20 00:16:06

當您將該計算移到循環之外時會發生什麼？ – djmj 2012-07-20 04:43:02

您在本例中看到的行爲是由於編譯器優化。在簡單循環的情況下，循環的結果可以在編譯時計算，整個循環代碼由一個常量代替，而在包含if語句的循環中，循環的結果取決於其值不是編譯器已知，並且循環必須停留。

爲了證明這一點的話，讓我們來看看你的內核的只是略有修改的版本：

#define MAX_COMP (32) 

template<unsigned int s> 
__global__ void test_ifs(unsigned int * result){ 
    unsigned int k = 0; 
    for(int j=0;j<MAX_COMP;j++){ 
     switch (s) { 
      case 1: 
       if (j <= threadIdx.x%MAX_COMP){ k += j; } 
       break;    

      case 0: 
       { k += j; } 
     } 
    } 
    result[threadIdx.x] = k; 
} 

template __global__ void test_ifs<0>(unsigned int *); 
template __global__ void test_ifs<1>(unsigned int *);

它發射的PTX。對於第一種情況：

.entry _Z8test_ifsILj0EEvPj (
     .param .u32 __cudaparm__Z8test_ifsILj0EEvPj_result) 
    { 
    .reg .u16 %rh<3>; 
    .reg .u32 %r<6>; 
    .loc 14 4 0 
$LDWbegin__Z8test_ifsILj0EEvPj: 
    .loc 14 16 0 
    mov.u32  %r1, 496; <--- here the loop has been replaced with 496 
    ld.param.u32 %r2, [__cudaparm__Z8test_ifsILj0EEvPj_result]; 
    mov.u16  %rh1, %tid.x; 
    mul.wide.u16 %r3, %rh1, 4; 
    add.u32  %r4, %r2, %r3; 
    st.global.u32 [%r4+0], %r1; 
    .loc 14 17 0 
    exit; 
$LDWend__Z8test_ifsILj0EEvPj: 
    } // _Z8test_ifsILj0EEvPj

和第二種情況中，環仍保持不變：

.entry _Z8test_ifsILj1EEvPj (
     .param .u32 __cudaparm__Z8test_ifsILj1EEvPj_result) 
    { 
    .reg .u32 %r<11>; 
    .reg .pred %p<4>; 
    .loc 14 4 0 
$LDWbegin__Z8test_ifsILj1EEvPj: 
    cvt.u32.u16  %r1, %tid.x; 
    and.b32  %r2, %r1, 31; 
    mov.s32  %r3, 0; 
    mov.u32  %r4, 0; 
$Lt_1_3842: 
//<loop> Loop body line 4, nesting depth: 1, iterations: 32 
    .loc 14 7 0 
    add.u32  %r5, %r3, %r4; 
    setp.le.u32  %p1, %r3, %r2; 
    selp.u32 %r4, %r5, %r4, %p1; 
    add.s32  %r3, %r3, 1; 
    mov.u32  %r6, 32; 
    setp.ne.s32  %p2, %r3, %r6; 
    @%p2 bra $Lt_1_3842; 
    .loc 14 16 0 
    ld.param.u32 %r7, [__cudaparm__Z8test_ifsILj1EEvPj_result]; 
    mul24.lo.u32 %r8, %r1, 4; 
    add.u32  %r9, %r7, %r8; 
    st.global.u32 [%r9+0], %r4; 
    .loc 14 17 0 
    exit; 
$LDWend__Z8test_ifsILj1EEvPj: 
    } // _Z8test_ifsILj1EEvPj

你不應該得出這樣的結論差異將總是是由於編譯器的優化，因爲這在很大程度上取決於代碼和編譯器。但在這種情況下，這是不同的。

來源

2012-07-20 05:56:21 talonmies

額外的寄存器用法if if

回答

相關問題