我在NVIDIA論壇上發佈了這個內容,我想我會得到更多的眼睛來幫助。cuda線程和塊
我在嘗試擴展代碼以執行多個案例時遇到問題。我一直在考慮最常見的情況,現在是測試時間,我需要確保它適用於不同的情況。目前我的內核是在一個循環內執行的(我們沒有做一個內核調用來完成整個事情是有原因的),來計算一個矩陣行的值。最常見的情況是512列512列。我需要考慮尺寸爲512 x 512,1024 x 512,512 x 1024和其他組合的matricies,但最大的將是1024 x 1024矩陣。我一直在使用一個相當簡單的內核調用:
launchKernel<<<1,512>>>(................)
這個內核工作正常,爲共同的512×512和512×1024(列,分別排)的情況下,而不是爲1024×512的情況。這種情況下需要執行1024個線程。在我的天真中,我一直在嘗試使用不同版本的簡單內核調用來啓動1024個線程。
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
我beleive我的問題有話跟我缺乏的線程和塊
這裏的理解是DEVICEQUERY的輸出,你可以看到我可以有1024個線程
一個最大C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version/Runtime Version 4.2/4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID/PCI location ID: 40/0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version/Runtime Version 4.2/4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
(2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID/PCI location ID: 15/0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
我只使用特斯拉C2050設備 這裏是我的內核的一個剝離版本,所以你有一個它在做什麼的想法。
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
因爲我每個塊的最大線程數是1024,我以爲我能繼續使用簡單的內核啓動,我錯了嗎?
如何以1024線程成功啓動每個內核?
實際問題是什麼?什麼不工作?如果你有1個塊和1024個線程,你需要一個大小爲1024的共享數組,而不是MaxSize = 999的索引。 – djmj
MaxSize = 999,與線程無關,其數組的正好大小被複制到共享內存。每個線程必須遍歷整個數組以獲得總和(簡化實際算法)。問題是,當我嘗試在512列和1024行的情況下使用1024個線程時,它不會工作。由此產生的outputArray1/2沒有被完全填充,所以全部1024個線程都沒有被執行。 –
我不知道你的內核在做什麼,但是如果你使用1024個線程並使用「SharedArray1 [idx] = array1 [idx];」你將定義寫出超出範圍,因爲對於那些有興趣的人來說,數組大小爲999,而idx可以在範圍[0-1023] – djmj