2012-10-28 21 views
0

我正在開發一個在循環中多次調用同一個內核的OpenCL程序。當我使用clEnqueueReadBuffer將設備內存傳回主機時,它會報告命令隊列無效。OpenCL:在多次循環調用內核之後從設備獲取INVALID_COMMAND_QUEUE到主機內存傳輸

下面是一個函數,它被稱爲啓動一個雙聲道排序,它被縮短,使其更具可讀性。設備列表,上下文,命令隊列和內核在外部創建並傳遞給此函數。 列表包含要排序的列表,大小列表中的元素數

cl_int OpenCLBitonicSort(cl_device_id device, cl_context context, 
    cl_command_queue commandQueue, cl_kernel bitonicSortKernel, 
    unsigned int * list, unsigned int size){ 

    //create OpenCL specific variables 
    cl_int error = CL_SUCCESS; 
    size_t maximum_local_ws; 
    size_t local_ws; 
    size_t global_ws; 

    //create variables that keep track of bitonic sorting progress 
    unsigned int stage = 0; 
    unsigned int subStage; 
    unsigned int numberOfStages = 0; 

    //get maximum work group size 
    clGetKernelWorkGroupInfo(bitonicSortKernel, device, 
     CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws), 
     &maximum_local_ws, NULL); 

    //make local_ws the largest perfect square allowed by OpenCL 
    for(i = 1; i <= maximum_local_ws; i *= 2){ 
     local_ws = (size_t) i; 
    } 
    //total number of comparators will be half the items in the list 
    global_ws = (size_t) size/2; 

    //transfer list to the device 
    cl_mem list_d = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, 
     size * sizeof(unsigned int), list, &error); 

    //find the number of stages needed (numberOfStages = ln(size)) 
    for(numberOfStages = 0; (1 << numberOfStages^size); numberOfStages++){ 
    } 

    //loop through all stages 
    for(stage = 0; stage < numberOfStages; stage++){ 
     //loop through all substages in each stage 
     for(subStage = stage, i = 0; i <= stage; subStage--, i++){ 
      //add kernel parameters 
      error = clSetKernelArg(bitonicSortKernel, 0, 
       sizeof(cl_mem), &list_d); 
      error = clSetKernelArg(bitonicSortKernel, 1, 
       sizeof(unsigned int), &size); 
      error = clSetKernelArg(bitonicSortKernel, 2, 
       sizeof(unsigned int), &stage); 
      error = clSetKernelArg(bitonicSortKernel, 3, 
       sizeof(unsigned int), &subStage); 

      //call the kernel 
      error = clEnqueueNDRangeKernel(commandQueue, bitonicSortKernel, 1, 
       NULL, &global_ws, &local_ws, 0, NULL, NULL); 

      //wait for the kernel to stop executing 
      error = clEnqueueBarrier(commandQueue); 
     } 
    } 

    //read the result back to the host 
    error = clEnqueueReadBuffer(commandQueue, list_d, CL_TRUE, 0, 
     size * sizeof(unsigned int), list, 0, NULL, NULL); 

    //free the list on the device 
    clReleaseMemObject(list_d); 

    return error; 
} 

在這段代碼中:clEnqueueReadBuffer表示commandQueue無效。然而,當我調用clEnqueueNDRangeKernel和clEnqueueBarrier時它是有效的。

當我設置 numberOfStages僅僅是1和階段僅僅是0,這樣clEnqueueNDRangeKernel只調用一次,代碼工作沒有返回錯誤(雖然結果是不正確的)。多次調用clEnqueueNDRangeKernel有一個問題(我真的需要這樣做)。

我在Mac OS 10.6 Snow Leopard上,我正在使用Apple的OpenCL 1.0平臺和NVidia GeForce 9600m。在其他平臺上的OpenCL中是否可以在循環內運行內核?有沒有人在OS X上有OpenCL這樣的問題?什麼可能導致命令隊列無效?

回答

0

爲了回答您的問題之一:是的,你可以排隊內核任意數量的進入命令隊列(無論是從一個循環內或以其他方式)。我可以證實,這至少可以在Windows上運行NVIDIA,AMD和Intel驅動程序。