OpenACC紅黑色Gauss-Seidel比CPU慢

我在拉普拉斯方程（一個簡單的加熱板問題）的紅黑色Gauss-Seidel求解器中添加了OpenACC指令，但GPU加速代碼不比CPU快，即使對於大問題。OpenACC紅黑色Gauss-Seidel比CPU慢

我也寫了一個CUDA版本，這比兩者都快很多（對於512x512，大約2秒，而CPU和OpenACC則爲25）。

任何人都可以想到這種差異的原因？我意識到CUDA提供了最有潛力的速度，但OpenACC應該爲CPU提供比CPU更好的解決方案（例如Jacobi解算器用於解決與here相同的問題）。

下面是相關代碼（全工作源是here）：

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp]) 
// red-black Gauss-Seidel with SOR iteration loop 
for (iter = 1; iter <= it_max; ++iter) { 
    Real norm_L2 = 0.0; 

    // update red cells 
    #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ 
     reduction(+:norm_L2) 
    #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) 
    #pragma acc loop independent gang vector(4) 
    for (int col = 1; col < NUM + 1; ++col) { 
    #pragma acc loop independent gang vector(64) 
    for (int row = 1; row < (NUM/2) + 1; ++row) { 

     int ind_red = col * ((NUM/2) + 2) + row;  // local (red) index 
     int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index 

     #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) 

     Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM/2) + 2)] 
         + aE[ind] * temp_black[row + (col + 1) * ((NUM/2) + 2)] 
         + aS[ind] * temp_black[row - (col % 2) + col * ((NUM/2) + 2)] 
         + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM/2) + 2)]); 

     Real temp_old = temp_red[ind_red]; 
     temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res/aP[ind]); 

     // calculate residual 
     res = temp_red[ind_red] - temp_old; 
     norm_L2 += (res * res); 

    } // end for row 
    } // end for col 

    // update black cells 
    #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ 
      reduction(+:norm_L2) 
    #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) 
    #pragma acc loop independent gang vector(4) 
    for (int col = 1; col < NUM + 1; ++col) { 
    #pragma acc loop independent gang vector(64) 
    for (int row = 1; row < (NUM/2) + 1; ++row) { 

     int ind_black = col * ((NUM/2) + 2) + row;  // local (black) index 
     int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index 

     #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) 

     Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM/2) + 2)] 
         + aE[ind] * temp_red[row + (col + 1) * ((NUM/2) + 2)] 
         + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM/2) + 2)] 
         + aN[ind] * temp_red[row + (col % 2) + col * ((NUM/2) + 2)]); 

     Real temp_old = temp_black[ind_black]; 
     temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res/aP[ind]); 

     // calculate residual 
     res = temp_black[ind_black] - temp_old;  
     norm_L2 += (res * res); 

    } // end for row 
    } // end for col 

    // calculate residual 
    norm_L2 = sqrt(norm_L2/((Real)size)); 

    if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2); 

    // if tolerance has been reached, end SOR iterations 
    if (norm_L2 < tol) { 
    break; 
    } 
}

來源

2012-10-19 Kyle Niemeyer

行，當我增加問題大小1024×1024到（'NUM = 1024'），結果是更有利的：與的OpenMP 4個CPU線程完成在約71秒，而OpenACC的大約在50左右。儘管如此，這仍然比CUDA（大約22秒）慢得多。 –

好的，我發現了一個半的解決方案，有些顯著降低了時間對於較小的問題。

如果我插入線：

acc_init(acc_device_nvidia); 
acc_set_device_num(0, acc_device_nvidia);

之前，我開始我的定時器，以激活並設置GPU，爲512x512的問題的時間減少到9.8秒，下降到42 1024×1024。問題規模的進一步增加進一步表明，即使OpenACC可以在四個CPU內核上運行，速度也可以比較快。

由於這一變化，OpenACC代碼的速度比CUDA代碼慢2倍，隨着問題規模越來越大，差距越來越小（〜1.2）。

來源

2012-10-19 20:04:59

acc_init需要1〜2秒。在CPU上運行512x512和1024x1024問題需要多少時間？ – ahmad

正如我所提到的，對於512x512的問題，在啓動計時器之前**沒有** acc_init，OpenACC代碼在大約25秒內運行。在計時器啓動後，它下降到約9.8秒。對於1024x1024，它從大約50秒到42秒。我在其他地方讀到acc_init應該只需要幾秒鐘，但對我的代碼（對於較小的問題）的影響似乎不止於此。 –

我下載了完整的代碼，然後編譯並運行它！沒有停止運行和指令

if（iter％100 == 0）printf（「％5d，％0.6f \ n」，iter，norm_L2）;

結果是：

100，南

200，南

....

我改變了所有的變量類型真實分類爲浮子，結果爲：

100，0.000654

200，0.000370

...，....

...，....

8800,0.000002

8900,0。000002

9000，0.000001

9100，0.000001

9200，0.000001

9300，0.000001

9400，0.000001

9500，0.000001

9600，0.000001

9700，0.000001

CPU

迭代：9796

總時間：5.594017小號

隨着NUM = 1024的結果爲：

迭代：27271

總時間：25.949905小號

來源

2012-11-25 10:38:39 pg1927

OpenACC紅黑色Gauss-Seidel比CPU慢

回答

相關問題