我在拉普拉斯方程(一個簡單的加熱板問題)的紅黑色Gauss-Seidel求解器中添加了OpenACC指令,但GPU加速代碼不比CPU快,即使對於大問題。OpenACC紅黑色Gauss-Seidel比CPU慢
我也寫了一個CUDA版本,這比兩者都快很多(對於512x512,大約2秒,而CPU和OpenACC則爲25)。
任何人都可以想到這種差異的原因?我意識到CUDA提供了最有潛力的速度,但OpenACC應該爲CPU提供比CPU更好的解決方案(例如Jacobi解算器用於解決與here相同的問題)。
下面是相關代碼(全工作源是here):
#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
Real norm_L2 = 0.0;
// update red cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
reduction(+:norm_L2)
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM/2) + 1; ++row) {
int ind_red = col * ((NUM/2) + 2) + row; // local (red) index
int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM/2) + 2)]
+ aE[ind] * temp_black[row + (col + 1) * ((NUM/2) + 2)]
+ aS[ind] * temp_black[row - (col % 2) + col * ((NUM/2) + 2)]
+ aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM/2) + 2)]);
Real temp_old = temp_red[ind_red];
temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res/aP[ind]);
// calculate residual
res = temp_red[ind_red] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// update black cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
reduction(+:norm_L2)
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM/2) + 1; ++row) {
int ind_black = col * ((NUM/2) + 2) + row; // local (black) index
int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM/2) + 2)]
+ aE[ind] * temp_red[row + (col + 1) * ((NUM/2) + 2)]
+ aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM/2) + 2)]
+ aN[ind] * temp_red[row + (col % 2) + col * ((NUM/2) + 2)]);
Real temp_old = temp_black[ind_black];
temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res/aP[ind]);
// calculate residual
res = temp_black[ind_black] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// calculate residual
norm_L2 = sqrt(norm_L2/((Real)size));
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);
// if tolerance has been reached, end SOR iterations
if (norm_L2 < tol) {
break;
}
}
行,當我增加問題大小1024×1024到('NUM = 1024'),結果是更有利的:與的OpenMP 4個CPU線程完成在約71秒,而OpenACC的大約在50左右。儘管如此,這仍然比CUDA(大約22秒)慢得多。 –