爲什麼CUDA內核中的這條語句很慢？

我正在使用CUDA做一些計算機視覺工作。以下代碼需要大約20秒才能完成。爲什麼CUDA內核中的這條語句很慢？

__global__ void nlmcuda_kernel(float* fpOMul,/*other input args*/){ 

float fpODenoised[75]; 

/*Do awesome stuff to compute fpODenoised*/ 

//inside nested loops:(This is the statement that is the bottleneck in the code.) 
     fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = fpODenoised[ii * iwl +iindex]; 

}

如果我更換

fpOMul[ii * iwl * iwxh + iindex * iwxh + il] = 2.0f;

該語句的代碼幾乎需要幾秒鐘就可以完成。

爲什麼指定的語句很慢，我該如何讓它運行得很快？

來源

2013-09-26 Aashish Thite

當您更改代碼時，編譯器會發現您的所有令人敬畏的代碼已不再需要，並且可以對其進行優化。您修改的實際陳述不是perf差異的直接原因。您可以通過查看每種情況下的ptx或sass代碼來驗證這一點。

來源

2013-09-26 18:42:19

爲什麼CUDA內核中的這條語句很慢？

回答

相關問題