0
我想在此代碼導入CUDA:CUDA優化:嵌套循環
double square=0;
for(int j=0;j<width; j++) {
double Up=0,Down=0;
for(int i=0;i<height; i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
在第一次嘗試我減少了第一個for循環。 (很好)
int j = blockDim.x * blockIdx.x + threadIdx.x;
double Up=0, Down=0, square=0;
if (j<width) {
for(int i=0;i<height;i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
我也會減少第二個循環,我試圖用2D網格不起作用。 這是內核:
int j = blockDim.x * blockIdx.x + threadIdx.x;
int i = blockDim.y * blockIdx.y + threadIdx.y;
int offset = j + i * blockDim.x * gridDim.x;
double Up[width],Down[width], square[height];
if (j>=width && i>=height) return;
if(array1[i]>0 && array2[i]>0){
square[i] = source[offset]*source[offset];
Up[j] += square[i]*array2[i]/array1[i];
Down[j] += square[i];
}
if(Down[j]>0){
out[j] *= (1.+(Up[j]/Down[j]-1.));
}
,這是內核調用:
dim3 blocks(32,32);
dim3 grid(width/32,height/32);
kernel <<< grid, blocks >>> (...);
cudaDeviceSynchronize();
...什麼是錯誤?有更有效的解決方案? (我可以使用動態並行?)
非常感謝!