我通過細分輸入矩陣(A [x/num_of_streams * y] B [x y] = C [x(x))在單GPU上的不同流上運行CUBLAS v2.0(Tesla C2050)/num_of_streams * y]),但不知何故,當我使用CUDA流時,需要更多的時間。以下是代碼片段:CUDA流的問題
//plan is a struct containing the matrix dimensions and stream numbers
//parallel in nstreams - should be! MAX 16 streams could run concurrently
//Copy A - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyA_in_streams (&plan[i]);
//Copy B - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyB_in_streams (&plan[i]);
//Create handles - serial
for(i = 0; i < nstreams; i++)
handle[i] = create_handle();
//Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm...
for(i = 0; i < nstreams; i++)
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
//Destroy handles - serial
for(i = 0; i < nstreams; i++)
destroy_handle (handle[i]);
//Copy C - cudaMemCpyAsync
for(i = 0; i < nstreams; i++)
cudgemm_copyC_in_streams (&plan[i]);
//EDIT: Function body
//The other two copy functions are exactly the same as this
void cudgemm_copyA_in_streams(TGPUplan *plan)
{
cudasafe(cudaMemcpyAsync(plan->Ad_Data, plan->Ah_Data, (plan->Acols * plan->Arows * sizeof(double)), cudaMemcpyHostToDevice, plan->stream));
}
//Create handle
cublasHandle_t create_handle()
{
cublasHandle_t handle;
checkError(cublasCreate(&handle), "cublasCreate() error!\n");
return handle;
}
//Destroy handle
void destroy_handle (cublasHandle_t handle)
{
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
//Kernel
void cudgemm_kernel_in_streams(TGPUplan *plan, cublasHandle_t handle, const double alpha, const double beta)
{
cublasStatus_t ret;
cublasSetStream(handle, plan->stream);
ret = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, plan->Arows, plan->Ccols, plan->Acols, &alpha, plan->Ad_Data, plan->Arows, plan->Bd_Data, plan->Brows, &beta, plan->Cd_Data, plan->Crows);
checkError(ret, "cublas Dgemm returned an error!\n");
}
所以我來回彈跳流和分配工作之間,期待獲得更好的執行時間,但我注意到,流的數量越多,該程序需要花費更多的時間,與不使用流的版本相比。我哪裏錯了? 交叉後Nvidia的論壇 -
編輯:
我修改我的程序如下:
//copy data
for(i = 0; i < nstreams; i++)
{
cudgemm_copyA_in_streams (&plan[i]);
cudgemm_copyB_in_streams (&plan[i]);
}
//Run kernel and copy back
for(i = 0; i < nstreams; i++)
{
cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
cudgemm_copyC_in_streams (&plan[i]);
}
當我分析我的計劃6144矩陣順序,我得到以下信息:
Kernel time = 42.75 % of total GPU time
Memory copy time = 28.9 % of total GPU time
Kernel taking maximum time = fermiDgemm_v2_kernel_val (42.8% of total GPU time)
Memory copy taking maximum time = memcpyHtoDasync (21.7% of total GPU time)
Total overlap time in GPU = 65268.3 micro sec. (3.6% of total GPU time)
當我的時間上面的循環,我得到0.000284s的時間,VS 1.703289s對於不使用流(在該版本中也有,我的時間兩個連續的內存拷貝,內核調用,其餘的memcpy)的版本。 我想既然我沒有使用任何同步結構,可能是我打印時間計算實際完成之前(我很難相信,有一個100%的改善)。
有在該代碼太多抽象地說,爲什麼什麼,但我會猜* *它是內存拷貝。您的設備具有2個DMA引擎,它可以與最多2個流上的異步存儲器傳輸重疊執行內核,或執行單個雙向直接傳輸。盲目排隊16次轉賬並不是表演的祕訣。你可以發佈你的一種複製方法的代碼嗎? – talonmies
我還沒有走到16條河流,但我正在測試2,4,8條河流。感謝您提醒我關於引擎數量......但第三個副本在內核執行後生效,這是在前兩個副本完成之後,因此當我複製C時DMA引擎應該是空閒的? – Sayan