2
我有一個內核,它計算總和。如果我通過內核計算所聲明的變量數,我會假設每個內核共有5個寄存器*。但是,在分析內核時,會使用34個寄存器。我需要下到30個寄存器來允許執行1024個線程。CUDA - 內核使用比預期更多的寄存器?
任何人都可以看到什麼是錯的?
__global__ void sum_kernel(float* values, float bk_size, int start_idx, int end_idx, int resolution, float* avgs){
// Allocate shared memory (assuming a maximum of 1024 threads).
__shared__ float sums[1024];
// Boundary check.
if(blockIdx.x == 0){
avgs[blockIdx.x] = values[start_idx];
return;
}
else if(blockIdx.x == resolution-1) {
avgs[blockIdx.x] = values[start_idx+(end_idx-start_idx)-1];
return;
}
else if(blockIdx.x > resolution -2){
return;
}
// Iteration index calculation.
unsigned int idx_prev = floor((blockIdx.x + 0) * bk_size) + 1;
unsigned int from = idx_prev + threadIdx.x*(bk_size/blockDim.x);
unsigned int to = from + (bk_size/blockDim.x);
to = (to < (end_idx-start_idx))? to : (end_idx-start_idx);
// Partial average calculation using shared memory.
sums[threadIdx.x] = 0;
for (from; from < to; from++)
{
sums[threadIdx.x] += values[from+start_idx];
}
__syncthreads();
// Addition of partial sums.
if(threadIdx.x != 0) return;
from = 1;
for(from; from < 1024; from++)
{
sum += sums[from];
}
avgs[blockIdx.x] = sum;
}
- 假設每個指針2個寄存器,每個無符號整型1個寄存器,存儲在常數存儲參數。
小修正:謂詞不存儲在R寄存器中,但是謂詞寄存器(在這種情況下爲P0)。 – njuffa
@njuffa非常感謝。我立即修復了答案的最後一句。 – JackOLantern