下面是我用來將給定任務分成塊和網格的代碼。是的,你最終可能會啓動到很多塊(但只有很少),你可能會得到比所需的更多的實際線程,但這樣很容易和高效。查看下面的第二個代碼示例,瞭解我的簡單內核邊界檢查。 PS:我總是有block_size == 128
,因爲它在多核佔用率,註冊表使用率,共享內存要求和所有內核的聚結存取之間取得了很好的平衡。
代碼來計算一個很好的網格大小(主持人):
#define GRID_SIZE 65535
//calculate grid size (store result in grid/block)
void kernelUtilCalcGridSize(unsigned int num_threads, unsigned int block_size, dim3* grid, dim3* block) {
//block
block->x = block_size;
block->y = 1;
block->z = 1;
//number of blocks
unsigned int num_blocks = kernelUtilCeilDiv(num_threads, block_size);
unsigned int total_threads = num_blocks * block_size;
assert(total_threads >= num_threads);
//calculate grid size
unsigned int gy = kernelUtilCeilDiv(num_blocks, GRID_SIZE);
unsigned int gx = kernelUtilCeilDiv(num_blocks, gy);
unsigned int total_blocks = gx * gy;
assert(total_blocks >= num_blocks);
//grid
grid->x = gx;
grid->y = gy;
grid->z = 1;
}
//ceil division (rounding up)
unsigned int kernelUtilCeilDiv(unsigned int numerator, unsigned int denominator) {
return (numerator + denominator - 1)/denominator;
}
代碼來計算獨特的線程ID和檢查邊界(設備):
//some kernel
__global__ void kernelFoo(unsigned int num_threads, ...) {
//calculate unique id
const unsigned int thread_id = threadIdx.x;
const unsigned int block_id = blockIdx.x + blockIdx.y * gridDim.x;
const unsigned int unique_id = thread_id + block_id * blockDim.x;
//check range
if (unique_id >= num_threads) return;
//do the actual work
...
}
我不認爲這是一個大量努力/寄存器/代碼行來檢查正確性。
sigh ..然後我需要很多檢查代碼..謝謝 – 2011-03-29 22:22:23
@Paul:它只是一行設備代碼來檢查邊界。 – Stringer 2011-03-29 23:02:14
這不是我的情況,我正在研究nvidia SDK的合併可分卷積濾波器,順便說一句,我會看看我能做什麼 – 2011-03-30 10:47:24