GPU與CPU編程：處理時間不一致

我目前正在進行圖像跟蹤：感謝相機我跟蹤與Android系統交互的手指觸摸。使用OpenCL在GPU上完成圖像處理：我將相機輸出轉換爲黑白相框，以便獲得白色的點。這種方法的處理時間是65ms。由於我的目標是使程序更加流暢，我使用OpenCV方法在CPU上執行了相同的操作。這給了115ms的處理時間。問題在於程序在OpenCV方法中感覺更加反應性更快，並且我不明白在這種情況下處理時間可能會更長：這似乎與我相矛盾。對於測量，我繼續像這樣：GPU與CPU編程：處理時間不一致

start= clock(); 
finish = clock(); 
double time =((double)finish -start)/CLOCKS_PER_SEC; 
std::cout<<"process time : "<< time<<std::endl;

這裏是我的代碼：

static cv::Mat    original_Right,binary_Right; 
static cv::Mat    original_Left, binary_Left; 
int     width, height; 
clock_t     start,finish; 
double time = 0.0; 

width = (int) this->camera_Right.getCapture().get(cv::CAP_PROP_FRAME_WIDTH); 
height = (int) this->camera_Right.getCapture().get(cv::CAP_PROP_FRAME_HEIGHT); 
original_Right.create(height, width, CV_8UC3); 


//--------------------------- Camera 2 --------------------------------- 
int width_2 = (int) this->camera_Left.getCapture().get(cv::CAP_PROP_FRAME_WIDTH); 
int height_2 = (int) this->camera_Left.getCapture().get(cv::CAP_PROP_FRAME_HEIGHT); 
original_Left.create(height_2, width_2, CV_8UC3); 


binary_Right.create(height, width, CV_32F); // FOR GPU 
binary_Left.create(height_2, width_2, CV_32F); // FOR GPU 
//binary_Right.create(height, width, CV_8UC1); // FOR CPU 
//binary_Left.create(height_2, width_2, CV_8UC1); // FOR CPU 

Core::running_ = true; 


//------------------------------------ SET UP THE GPU ----------------------------------------- 
cl_context    context; 
cl_context_properties properties [3]; 
cl_kernel    kernel; 
cl_command_queue  command_queue; 
cl_program    program; 
cl_int     err; 
cl_uint     num_of_platforms=0; 
cl_platform_id   platform_id; 
cl_device_id   device_id; 
cl_uint     num_of_devices=0; 
cl_mem     input, output; 

size_t     global; 

int      data_size =height*width*3; 


//load opencl source 
FILE *fp; 
char fileName[] = "./helloTedKrissV2.cl"; 
char *source_str; 

//Load the source code containing the kernel 
fp = fopen(fileName, "r"); 
if (!fp) { 
fprintf(stderr, "Failed to load kernel.\n"); 
exit(1); 
} 
source_str = (char*)malloc(MAX_SOURCE_SIZE); 
global = fread(source_str, 1, MAX_SOURCE_SIZE, fp); 
fclose(fp); 


//retreives a list of platforms available 
if(clGetPlatformIDs(1,&platform_id, &num_of_platforms)!=CL_SUCCESS){ 
    std::cout<<"unable to get a platform_id"<<std::endl; 
}; 

// to get a supported GPU device 
if(clGetDeviceIDs(platform_id,CL_DEVICE_TYPE_GPU,1,&device_id, &num_of_devices)!= CL_SUCCESS){ 
    std::cout<<"unable to get a device_id"<<std::endl;  
}; 

//context properties list - must be terminated with 0 
properties[0]=CL_CONTEXT_PLATFORM; 
properties[1]=(cl_context_properties) platform_id; 
properties[2]=0; 

// create a context with the gpu device 
context = clCreateContext(properties,1,&device_id,NULL,NULL,&err); 

//create command queue using the context and device 
command_queue = clCreateCommandQueue(context,device_id,0,&err); 

//create a program from the kernel source code 
program= clCreateProgramWithSource(context,1,(const char **) &source_str, NULL,&err); 

// compile the program 
if(clBuildProgram(program,0,NULL,NULL,NULL,NULL)!=CL_SUCCESS){ 
    size_t length; 
    std::cout<<"Error building program"<<std::endl; 
    char buffer[4096]; 
    clGetProgramBuildInfo(program,device_id,CL_PROGRAM_BUILD_LOG, sizeof(buffer),buffer,&length); 
    std::cout<< buffer <<std::endl; 
} 

//specify which kernel from the program to execute 
kernel = clCreateKernel(program,"imageProcessing",&err); 




while (this->isRunning() == true) { 

    start= clock(); //--------------------- START---------------------- 

    //----------------------FRAME--------------------- 
    this->camera_Right.readFrame(original_Right); 
    if (original_Right.empty() == true) { 
     std::cerr << "[Core/Error] Original frame is empty." << std::endl; 
     break; 
    } 

    this->camera_Left.readFrame(original_Left); 
    if (original_Left.empty() == true) { 
     std::cerr << "[Core/Error] Original 2 frame is empty." << std::endl; 
     break; 
    } 
    //----------------------FRAME--------------------- 



    //------------------------------------------------IMP GPU ------------------------------------------------------ 

    input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR , sizeof(unsigned char)*data_size,NULL,NULL); 
    output =clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(float)*data_size/3,NULL,NULL); 

    if(clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,sizeof(unsigned char)*data_size, original_Right.data ,0,NULL,NULL)!= CL_SUCCESS){}; 

    //set the argument list for the kernel command 
    clSetKernelArg(kernel,0,sizeof(cl_mem), &input); 
    clSetKernelArg(kernel,1,sizeof(cl_mem), &output); 
    global = data_size ; 
    //enqueue the kernel command for execution 
    clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL,0,NULL,NULL); 
    clFinish(command_queue); 
    //copy the results from out of the output buffer 
    if(clEnqueueReadBuffer(command_queue,output,CL_TRUE ,0,sizeof(float)*data_size/3,binary_Right.data,0,NULL,NULL)!= CL_SUCCESS){}; 

    clReleaseMemObject(input); 
    clReleaseMemObject(output); 

    //------------------------------------------------IMP GPU ------------------------------------------------------ 

    input = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR , sizeof(unsigned char)*data_size,NULL,NULL); 
    output =clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(float)*data_size/3,NULL,NULL); 

    if(clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,sizeof(unsigned char)*data_size, original_Left.data ,0,NULL,NULL)!= CL_SUCCESS){}; 

    //set the argument list for the kernel command 
    clSetKernelArg(kernel,0,sizeof(cl_mem), &input); 
    clSetKernelArg(kernel,1,sizeof(cl_mem), &output); 
    global = data_size ; 
    //enqueue the kernel command for execution 
    clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL,0,NULL,NULL); 
    clFinish(command_queue); 
    //copy the results from out of the output buffer 
    if(clEnqueueReadBuffer(command_queue,output,CL_TRUE ,0,sizeof(float)*data_size/3,binary_Left.data,0,NULL,NULL)!= CL_SUCCESS){}; 

    clReleaseMemObject(input); 
    clReleaseMemObject(output); 

    //------------------------------------------------IMP GPU ------------------------------------------------------ 

    // CPU METHOD 
    // adok::processing::doImageProcessing(original_Right, binary_Right); 
    // adok::processing::doImageProcessing(original_Left, binary_Left); 

    //-------------------------------------------------------------- TRACKING ------------------------------------------------------ 

adok::tracking::doFingerContoursTracking(binary_Right,binary_Left, this->fingerContours, this->perspective_Right,this->perspective_Left, this->distortion_Right,this->distortion_Left, this); 

    //------------------------------------------- TRACKING ----------------------------------------- 

//------------------------------SEND COORDINATES TO ANDROID BOARD-------------------- 
if (getSideRight() && !getSideLeft()) { 
     std::cout<<"RIGHT : "<<std::endl; 
     this->uart_.sendAll(this->fingerContours, this->perspective_Right.getPerspectiveMatrix(), RIGHT); 
    }else if (!getSideRight() && getSideLeft()){ 
     std::cout<<"LEFT : "<<std::endl; 
     this->uart_.sendAll(this->fingerContours, this->perspective_Left.getPerspectiveMatrix(), LEFT); 
    }else if (getSideRight() && getSideLeft()){ 
     std::cout<<"RIGHT & LEFT : "<<std::endl; 
     this->uart_.sendAll(this->fingerContours, this->perspective_Right.getPerspectiveMatrix(), this->perspective_Left.getPerspectiveMatrix()); 

    } 

this->setSideRight(0); 
this->setSideLeft(0); 

finish = clock(); 
time =(double)(finish - start)/CLOCKS_PER_SEC; 
std::cout << "Time: " << time << std::endl; // ------------END----------- 

} 
clReleaseCommandQueue(command_queue); 
clReleaseProgram(program); 
clReleaseKernel(kernel); 
clReleaseContext(context); 
this->stop();

}

也有一些奇怪的事情，當我在CPU的抓取時間一幀是5ms，而在GPU上是15ms，我不知道爲什麼它會增加。

而我正在研究android xu4。

來源

2016-10-26 A. Kriss

謝謝你的回答！我發現它爲什麼會感覺停止：這是因爲抓取幀的時間從5ms到15ms。這可能是因爲我創建了一個減少帶寬的緩衝區。 GPU上的編程比CPU上的編程快，但會影響圖像/秒。和這樣做的原因是因爲我在做這個的兩倍（每個攝像機）：

if(clEnqueueWriteBuffer(command_queue,input,CL_TRUE,0,sizeof(unsigned char)*data_size, original_Right.data ,0,NULL,NULL)!= CL_SUCCESS){}; 
    //set the argument list for the kernel command 
    clSetKernelArg(kernel,0,sizeof(cl_mem), &input); 
    clSetKernelArg(kernel,1,sizeof(cl_mem), &output); 
    global = data_size ; 
    //enqueue the kernel command for execution 
    clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL,0,NULL,NULL); 
    clFinish(command_queue); 
    //copy the results from out of the output buffer 
    if(clEnqueueReadBuffer(command_queue,output,CL_TRUE ,0,sizeof(unsigned char)*data_size,original_Right.data,0,NULL,NULL)!= CL_SUCCESS){};

來源

2016-10-28 07:56:18

在GPU計算有時它可能比CPU計算需要很多時間。因爲，對於GPU計算，主進程將數據發送到GPU內存，經過數學計算後GPU將數據發送回CPU。所以，數據傳輸和接收回到CPU需要時間。如果計算得出的緩衝區大小較大且傳輸時間較長，則可能需要更多時間進行計算。 CUDNN庫與GPU處理器使它快許多倍。所以，如果你的程序不使用CUDNN它可能會更慢。

來源

2016-10-26 10:32:10

我與OpenCL的工作，所以我想，我不能使用CUDNN庫。但這並不能解釋爲什麼我測量的時間更低，而且真的感覺停止。 –

什麼是你的框架尺寸？ –

看到這個http://opencv-users.1802565.n2.nabble.com/Poor-OpenCL-performance-td7584466.html –

您可以嘗試使用事件來查看寫入數據需要多長時間以及處理數據需要多長時間。
並且在一般情況下使用clFinish不是一個好主意。當您從Enqueue命令中獲取事件並將其傳遞給Read Data時，讀取的數據將在處理完成時立即發生。另一個問題是，您不必每次都重新創建緩衝區對象，只要您具有相同的數據大小，您可以創建一次並保持重用。

來源

2016-10-26 21:53:14

GPU與CPU編程：處理時間不一致

回答

相關問題