GPU內核中的指針和位操作符

我想使用GPU內核在一個捲上執行雙重閾值。我發送我的音量，每片，作爲read_only image2d_t。我的輸出音量是二進制音量，其中每位指定是否啓用其相關的體素。我的內核檢查當前像素值是否在低/高閾值範圍內，並在二進制卷中啓用其相應位。GPU內核中的指針和位操作符

爲了調試目的，我留下了實際檢查現在評論。我只是使用傳遞的片nr來確定二進制卷位應該打開還是關閉。前14個切片設置爲「開」，其餘切爲「關」。我也在CPU端驗證了這個代碼，我粘貼在這篇文章底部的代碼。該代碼顯示了兩條路徑，現在正在評論CPU。

的CPU代碼按預期工作，下面的圖像的渲染與二進制掩碼體積應用後返回：用我的GPU內核

Rendering with a correct computed mask

運行完全相同的邏輯返回不正確的結果（第一3D ，第二切片視圖）：

Rendering with an incorrect GPU computed mask

Rendering with an incorrect GPU computed mask (sliceview)

這裏出了什麼問題？據我所知，OpenCL不支持位字段，但它支持按位運算符，據我所知，OpenCL規範。我的位邏輯從32位字中選擇正確的位並將其翻轉，是否支持？或者，我的簡單標誌被認爲是一個位域。它的功能是從左邊選擇體素％32位（不是右邊，因此減去）。

另一件事可能是傳遞給我的內核的uint指針與我所期望的不同。我認爲這將是有效的指針使用和數據傳遞給我的內核。應用於內核「uint * word」部分的邏輯是由於每行填充字以及每片填充行。 CPU變量確認指針計算邏輯是有效的。

下面;代碼

  uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels); 
      uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels); 

      int[] dims = new int[3]; 
      dims[0] = volume.Geometry.NumberOfVoxels.X; 
      dims[1] = volume.Geometry.NumberOfVoxels.Y; 
      dims[2] = volume.Geometry.NumberOfVoxels.Z; 

      uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[]; 
      unsafe { 
       fixed(int* dimPtr = dims) { 
        fixed(uint *arrcPtr = arrC) { 
         // pick Cloo Platform 
         ComputePlatform platform = ComputePlatform.Platforms[0]; 

         // create context with all gpu devices 
         ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, 
          new ComputeContextPropertyList(platform), null, IntPtr.Zero); 

         // load opencl source 
         StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl"); 
         string clSource = streamReader.ReadToEnd(); 
         streamReader.Close(); 

         // create program with opencl source 
         ComputeProgram program = new ComputeProgram(context, clSource); 

         // compile opencl source 
         program.Build(null, null, null, IntPtr.Zero); 

         // Create the event wait list. An event list is not really needed for this example but it is important to see how it works. 
         // Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution. 
         // For this reason their use should be avoided if possible. 
         ComputeEventList eventList = new ComputeEventList(); 

         // Create the command queue. This is used to control kernel execution and manage read/write/copy operations. 
         ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None); 

         // Create the kernel function and set its arguments. 
         ComputeKernel kernel = program.CreateKernel("LowerThreshold"); 

         int slicenr = 0; 
         foreach (IntPtr ptr in pinnedSlices) { 
          /*// CPU VARIANT FOR TESTING PURPOSES 
          for (int y = 0; y < dims[1]; y++) { 
           for (int x = 0; x < dims[0]; x++) { 
            long pixelOffset = x + y * dims[0]; 
            ushort* ushortPtr = (ushort*)ptr; 
            ushort pixel = *(ushortPtr + pixelOffset); 

            int BinaryWordShift = 5; 
            int BinaryWordBits = 32; 
            if (
             (0 <= x) && 
             (0 <= y) && 
             (0 <= slicenr) && 
             (x < dims[0]) && 
             (y < dims[1]) && 
             (slicenr < dims[2]) 
            ) { 
             uint* word = 
              arrcPtr + 1 + (slicenr * wordsPerPlane) + 
              (y * wordsPerRow) + 
              (x >> BinaryWordShift); 
             uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f))); 
             //if (pixel > lowerThreshold && pixel < upperThreshold) { 
             if (slicenr < 15) { 
              *word |= mask; 
             } else { 
              *word &= ~mask; 
             } 
            } 
           } 
          }*/ 

          ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
           context, 
           ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, 
           3, 
           new IntPtr(dimPtr)); 

          ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16); 
          ComputeImage2D image2D = new ComputeImage2D(
           context, 
           ComputeMemoryFlags.ReadOnly, 
           format, 
           volume.Geometry.NumberOfVoxels.X, 
           volume.Geometry.NumberOfVoxels.Y, 
           0, 
           ptr 
          ); 

          // The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length). 
          ComputeBuffer<uint> c = new ComputeBuffer<uint>(
           context, ComputeMemoryFlags.WriteOnly, arrC.Length); 

          kernel.SetMemoryArgument(0, image2D); 
          kernel.SetMemoryArgument(1, dimsBuffer); 
          kernel.SetValueArgument(2, wordsPerRow); 
          kernel.SetValueArgument(3, wordsPerPlane); 
          kernel.SetValueArgument(4, slicenr); 
          kernel.SetValueArgument(5, lowerThreshold); 
          kernel.SetValueArgument(6, upperThreshold); 
          kernel.SetMemoryArgument(7, c); 

          // Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command. 
          // If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created. 
          commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList); 

          // Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer 
          // will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete 
          // by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host. 
          // eventList will contain two events after this method returns. 
          commands.ReadFromBuffer(c, ref arrC, false, eventList); 

          // A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands 
          // in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands 
          // to finish has to be issued before "arrC" can be used. 
          // This explicit synchronization can be achieved in two ways: 
          // 1) Wait for the events in the list to finish, 
          //eventList.Wait(); 
          //} 
          // 2) Or simply use 
          commands.Finish(); 

          slicenr++; 
         } 

        } 
       } 
      }

而且我的內核代碼：

const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP | CLK_NORMALIZED_COORDS_FALSE; 
kernel void LowerThreshold(
    read_only image2d_t image, 
    global int* brickSize, 
    uint wordsPerRow, 
    uint wordsPerPlane, 
    int slicenr, 
    int lower, 
    int upper, 
    global write_only uint* c) 
{ 

    int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1); 
    uint4 pixel = read_imageui(image, smp, coord.xy); 

    uchar BinaryWordShift = 5; 
    int BinaryWordBits = 32; 
    if (
      (0 <= coord.x) && 
      (0 <= coord.y) && 
      (0 <= coord.z) && 
      (coord.x < brickSize[0]) && 
      (coord.y < brickSize[1]) && 
      (coord.z < brickSize[2]) 
    ) { 
     global uint* word = 
      c + 1 + (coord.z * wordsPerPlane) + 
      (coord.y * wordsPerRow) + 
      (coord.x >> BinaryWordShift); 

     uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f))); 
     //if (pixel.w > lower && pixel.w < upper) { 
     if (slicenr < 15) { 
      *word |= mask; 
     } else { 
      *word &= ~mask; 
     } 
    } 
}

來源

2013-05-28 bastijn

哪一個是你給的內核？ CPU還是GPU？ –

同一個*字是由多個線程同時寫入的嗎？這會產生危險，可能使用原子操作或使用不同的策略來更新位 –

內核代碼是GPU內核。內核的CPU實現在它上面的代碼部分被註釋掉了。是的相同*字（因此傳入* uint c）由多個線程寫入。我預計底層opencl在全局緩存寫入級別會有某種鎖定等待結構，但情況並非如此？ – bastijn

兩個問題：

您已經聲明「C」作爲「WRITE_ONLY」，但使用「| =」和「& =」運營商，它們是讀取 - 修改 - 寫入
由於其他海報m如果兩個工作項訪問的是同一個單詞，那麼在讀取 - 修改 - 寫入之間會出現競爭條件，從而導致錯誤。原子操作比非原子操作慢得多，所以雖然可能，但不推薦。

我建議使您的輸出變大8倍，並使用字節而不是位。這將使您的輸出只寫，並且還會消除爭用並因此消除競爭條件。或者（如果數據緊湊性或格式很重要）每個工作項一次處理8個元素，並將複合8位輸出寫爲單個字節。這將是隻寫的，沒有爭用，並且仍然具有您的數據緊湊性。

來源

2013-06-02 14:21:04 Dithermaster

選擇這個作爲答案，因爲最後一部分正是我最終實現的。緊湊性確實是必需的，這既是爲了不破壞其他功能以及其內存消耗。謝謝。 – bastijn

GPU內核中的指針和位操作符

回答

相關問題