cuda線程和塊

我在NVIDIA論壇上發佈了這個內容，我想我會得到更多的眼睛來幫助。cuda線程和塊

我在嘗試擴展代碼以執行多個案例時遇到問題。我一直在考慮最常見的情況，現在是測試時間，我需要確保它適用於不同的情況。目前我的內核是在一個循環內執行的（我們沒有做一個內核調用來完成整個事情是有原因的），來計算一個矩陣行的值。最常見的情況是512列512列。我需要考慮尺寸爲512 x 512,1024 x 512,512 x 1024和其他組合的matricies，但最大的將是1024 x 1024矩陣。我一直在使用一個相當簡單的內核調用：

launchKernel<<<1,512>>>(................)

這個內核工作正常，爲共同的512×512和512×1024（列，分別排）的情況下，而不是爲1024×512的情況。這種情況下需要執行1024個線程。在我的天真中，我一直在嘗試使用不同版本的簡單內核調用來啓動1024個線程。

launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ??? 
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

我beleive我的問題有話跟我缺乏的線程和塊

這裏的理解是DEVICEQUERY的輸出，你可以看到我可以有1024個線程

一個最大

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Found 2 CUDA Capable device(s) 

Device 0: "Tesla C2050" 
    CUDA Driver Version/Runtime Version   4.2/4.1 
    CUDA Capability Major/Minor version number: 2.0 
    Total amount of global memory:     2688 MBytes (2818572288 bytes) 
    (14) Multiprocessors x (32) CUDA Cores/MP:  448 CUDA Cores 
    GPU Clock Speed:        1.15 GHz 
    Memory Clock rate:        1500.00 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         786432 bytes 
    Max Texture Dimension Size (x,y,z)    1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) 
    Max Layered Texture Size (dim) x layers  1D=(16384) x 2048, 2D=(16384,16384) x 2048 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 32768 
    Warp size:          32 
    Maximum number of threads per block:   1024 
    Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
    Maximum sizes of each dimension of a grid:  65535 x 65535 x 65535 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and execution:     Yes with 2 copy engine(s) 
    Run time limit on kernels:      Yes 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Concurrent kernel execution:     Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support enabled:    Yes 
    Device is using TCC driver mode:    No 
    Device supports Unified Addressing (UVA):  No 
    Device PCI Bus ID/PCI location ID:   40/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "Quadro 600" 
    CUDA Driver Version/Runtime Version   4.2/4.1 
    CUDA Capability Major/Minor version number: 2.1 
    Total amount of global memory:     1024 MBytes (1073741824 bytes) 
    (2) Multiprocessors x (48) CUDA Cores/MP:  96 CUDA Cores 
    GPU Clock Speed:        1.28 GHz 
    Memory Clock rate:        800.00 Mhz 
    Memory Bus Width:        128-bit 
    L2 Cache Size:         131072 bytes 
    Max Texture Dimension Size (x,y,z)    1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) 
    Max Layered Texture Size (dim) x layers  1D=(16384) x 2048, 2D=(16384,16384) x 2048 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 32768 
    Warp size:          32 
    Maximum number of threads per block:   1024 
    Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
    Maximum sizes of each dimension of a grid:  65535 x 65535 x 65535 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and execution:     Yes with 1 copy engine(s) 
    Run time limit on kernels:      Yes 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Concurrent kernel execution:     Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support enabled:    No 
    Device is using TCC driver mode:    No 
    Device supports Unified Addressing (UVA):  No 
    Device PCI Bus ID/PCI location ID:   15/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

我只使用特斯拉C2050設備這裏是我的內核的一個剝離版本，所以你有一個它在做什麼的想法。

#define twoPi    6.283185307179586 
#define speed_of_light  3.0E8 
#define MaxSize    999 

__global__ void calcRx4CPP4 
( 
     const float *array1, 
     const double *array2, 
     const float scalar1, 
     const float scalar2, 
     const float scalar3, 
     const float scalar4, 
     const float scalar5, 
     const float scalar6, 
     const int scalar7, 
     const int scalar8,  
     float *outputArray1, 
     float *outputArray2) 
{ 

    float scalar9; 
    int idx; 
    double scalar10; 
    double scalar11; 
    float sumReal, sumImag; 
    float real, imag; 

    float coeff1, coeff2, coeff3, coeff4; 

    sumReal = 0.0; 
    sumImag = 0.0; 

    // kk loop 1 .. 512 (scalar7) 
    idx = (blockIdx.x * blockDim.x) + threadIdx.x; 

    /* Declare the shared memory parameters */ 
    __shared__ float SharedArray1[MaxSize]; 
    __shared__ double SharedArray2[MaxSize]; 

    /* populate the arrays on shared memory */ 
    SharedArray1[idx] = array1[idx]; // first 512 elements 
    SharedArray2[idx] = array2[idx]; 
    if (idx+blockDim.x < MaxSize){ 
     SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x]; 
     SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x]; 
    }    
    __syncthreads(); 

    // input scalars used here. 
    scalar10 = ...; 
    scalar11 = ...; 

    for (int kk = 0; kk < scalar8; kk++) 
    { 
     /* some calculations */ 
     // SharedArray1, SharedArray2 and scalar9 used here 
     sumReal = ...; 
     sumImag = ...; 
    } 


    /* calculation of the exponential of a complex number */ 
    real = ...; 
    imag = ...; 
    coeff1 = (sumReal * real); 
    coeff2 = (sumReal * imag); 
    coeff3 = (sumImag * real); 
    coeff4 = (sumImag * imag); 

    outputArray1[idx] = (coeff1 - coeff4); 
    outputArray2[idx] = (coeff2 + coeff3); 


}

因爲我每個塊的最大線程數是1024，我以爲我能繼續使用簡單的內核啓動，我錯了嗎？

如何以1024線程成功啓動每個內核？

來源

2012-05-04 Beau Bellamy

實際問題是什麼？什麼不工作？如果你有1個塊和1024個線程，你需要一個大小爲1024的共享數組，而不是MaxSize = 999的索引。 – djmj

MaxSize = 999，與線程無關，其數組的正好大小被複制到共享內存。每個線程必須遍歷整個數組以獲得總和（簡化實際算法）。問題是，當我嘗試在512列和1024行的情況下使用1024個線程時，它不會工作。由此產生的outputArray1/2沒有被完全填充，所以全部1024個線程都沒有被執行。 –

我不知道你的內核在做什麼，但是如果你使用1024個線程並使用「SharedArray1 [idx] = array1 [idx];」你將定義寫出超出範圍，因爲對於那些有興趣的人來說，數組大小爲999，而idx可以在範圍[0-1023] – djmj

你不想改變每塊的線程數。您應該使用CUDA佔用率計算器爲內核獲取每塊最佳線程數。獲得該數字後，只需啓動獲取所需線程總數所需的塊數即可。如果給定情況下需要的線程數不總是每塊的線程數的倍數，那麼可以在內核頂部添加代碼以中止不需要的線程。（if() return;）。然後，根據內核中需要哪些信息（我沒有研究它），將矩陣的維度或者額外參數傳遞給內核，或者使用x和y維度傳遞。

我的猜測是，你遇到與1024線故障的原因是，即使你的GPU支持，在一個塊多線程，還有另外一個限制因素的線程數，你可以在基於每塊關於內核中的資源使用情況。限制因素可以是共享內存或寄存器使用情況。佔用率計算器會告訴你哪一個，儘管這個信息只有在你想優化你的內核時才重要。

來源

2012-05-04 03:20:46

，CUDA佔用率計算器位於：http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls –

「即使您的GPU支持塊中的多個線程，但基於內核中的資源使用情況，每個塊中可以擁有的線程數還有另一個限制因素。不應該LaunchKernel <<<2,512> >>（..）工作然後，因爲這是512塊線程每塊與2塊？ –

「這個內核適用於普通的512x512和512 x 1024（分列，行）情況，但不適用於1024 x 512的情況，這種情況下需要1024個線程才能執行。因此，對於（x，y）大小的矩陣，每次使用x個線程調用內核y次？ –

如果你使用一個1024線程的塊，你將會遇到問題，因爲MaxSize只有999，導致錯誤的數據。

讓模擬它的最後一個線程＃1023

__shared__ float SharedArray1[999];  
__shared__ double SharedArray2[999]; 

/* populate the arrays on shared memory */  
SharedArray1[1023] = array1[1023]; 
SharedArray2[1023] = array2[1023];  

if (2047 < MaxSize) 
{   
    SharedArray1[2047] = array1[2047];   
    SharedArray2[2047] = array2[2047];  
}     
__syncthreads();

如果你現在使用在計算所有這些因素這不應該工作。（您的計算代碼未顯示，因此爲其假設）

來源

2012-05-04 11:54:13 djmj

回答

相關問題