4
我想測試cudaMalloc和cudaFree是否是同步調用,因此我對CUDA SDK中的「simpleMultiGPU.cu」示例代碼做了一些修改。以下是我改變的部分(所添加的行不縮進):cudaMalloc和cudaFree是同步還是異步調用?
float *dd[GPU_N];;
for (i = 0; i < GPU_N; i++){cudaSetDevice(i); cudaMalloc((void**)&dd[i], sizeof(float));}
//Start timing and compute on GPU(s)
printf("Computing with %d GPUs...\n", GPU_N);
StartTimer();
//Copy data to GPU, launch the kernel and copy data back. All asynchronously
for (i = 0; i < GPU_N; i++)
{
//Set device
checkCudaErrors(cudaSetDevice(i));
//Copy input data from CPU
checkCudaErrors(cudaMemcpyAsync(plan[i].d_Data, plan[i].h_Data, plan[i].dataN * sizeof(float), cudaMemcpyHostToDevice, plan[i].stream));
//Perform GPU computations
reduceKernel<<<BLOCK_N, THREAD_N, 0, plan[i].stream>>>(plan[i].d_Sum, plan[i].d_Data, plan[i].dataN);
getLastCudaError("reduceKernel() execution failed.\n");
//Read back GPU results
checkCudaErrors(cudaMemcpyAsync(plan[i].h_Sum_from_device, plan[i].d_Sum, ACCUM_N *sizeof(float), cudaMemcpyDeviceToHost, plan[i].stream));
cudaMalloc((void**)&dd[i],sizeof(float));
cudaFree(dd[i]);
//cudaStreamSynchronize(plan[i].stream);
}
通過在大環分別註釋出cudaMalloc線和cudaFree線,我發現,對於一個2 GPU系統中,GPU處理時間分別是30毫秒和20毫秒,所以我得出結論cudaMalloc是一個異步調用,而cudaFree是一個同步調用。不確定這是否爲真,並且不確定CUDA架構的設計關注點是什麼。 我的計算能力是2.0,我嘗試了cuda4.0和cuda5.0。