平均濾波器是線性類窗口濾波器，用於平滑信號（圖像）。該濾波器作爲低通濾波器工作。濾波器背後的基本思想是信號（圖像）的任何元素在其鄰域取平均值。Cuda圖像平均濾波器

如果我們已經在m x n矩陣，我們希望與它大小k應用平均濾波器，則矩陣中的每個點p:(i,j)點的值是所有點的平方平均

Square Kernel

這個數字是與大小2過濾廣場的內核，那黃色的盒子將被平均的像素，和所有的電網相鄰像素的平方，即像素的新值將是它們的平均值。

問題是這個算法很慢，特別是在大圖像上，所以我想到了使用GPGPU。

現在的問題是，如果可能的話，如何在cuda中執行此操作？

來源

2013-01-15 Sameh Kamal

嗨@SamehKamal，我很好奇只是好奇。使用CUDA的代碼與結果中的傳統代碼相比速度有多快？ –

這是一段很長的時間，我不記得這個算法的加速因子，但是我一直在使用的算法的性能從一個算法到另一個從x7到x22的加速比。 –

這是embarrassingly parallel圖像處理問題的一個經典案例，可以很容易地映射到CUDA框架。平均濾波器在圖像處理領域中被認爲是Box Filter。

最簡單的方法是將CUDA紋理用於過濾過程，因爲邊界條件可以通過紋理很容易地處理。

假設您在主機上分配了源和目標指針。程序會是這樣的。

分配足夠大的內存來保存設備上的源圖像和目標圖像。
將源圖像從主機複製到設備。
將源圖像設備指針綁定到紋理。
指定適當的塊大小和足夠大的網格以覆蓋圖像的每個像素。
使用指定的網格和塊大小啓動過濾內核。
將結果複製回主機。
解除綁定紋理
空閒設備指針。

示例實現箱過濾

內核

texture<unsigned char, cudaTextureType2D> tex8u; 

//Box Filter Kernel For Gray scale image with 8bit depth 
__global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight) 
{ 
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x; 
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y; 

    const int filter_offset_x = fWidth/2; 
    const int filter_offset_y = fHeight/2; 

    float output_value = 0.0f; 

    //Make sure the current thread is inside the image bounds 
    if(xIndex<width && yIndex<height) 
    { 
     //Sum the window pixels 
     for(int i= -filter_offset_x; i<=filter_offset_x; i++) 
     { 
      for(int j=-filter_offset_y; j<=filter_offset_y; j++) 
      { 
       //No need to worry about Out-Of-Range access. tex2D automatically handles it. 
       output_value += tex2D(tex8u,xIndex + i,yIndex + j); 
      } 
     } 

     //Average the output value 
     output_value /= (fWidth * fHeight); 

     //Write the averaged value to the output. 
     //Transform 2D index to 1D index, because image is actually in linear memory 
     int index = yIndex * pitch + xIndex; 

     output[index] = static_cast<unsigned char>(output_value); 
    } 
}

包裝函數：

void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight) 
{ 

    /* 
    * 2D memory is allocated as strided linear memory on GPU. 
    * The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing. 
    * It is the size of a row in bytes. 
    * It is not necessary that width = widthStep. 
    * Total bytes occupied by the image = widthStep x height. 
    */ 

    //Declare GPU pointer 
    unsigned char *GPU_input, *GPU_output; 

    //Allocate 2D memory on GPU. Also known as Pitch Linear Memory 
    size_t gpu_image_pitch = 0; 
    cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height); 
    cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height); 

    //Copy data from host to device. 
    cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice); 

    //Bind the image to the texture. Now the kernel will read the input image through the texture cache. 
    //Use tex2D function to read the image 
    cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch); 

    /* 
    * Set the behavior of tex2D for out-of-range image reads. 
    * cudaAddressModeBorder = Read Zero 
    * cudaAddressModeClamp = Read the nearest border pixel 
    * We can skip this step. The default mode is Clamp. 
    */ 
    tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder; 

    /* 
    * Specify a block size. 256 threads per block are sufficient. 
    * It can be increased, but keep in mind the limitations of the GPU. 
    * Older GPUs allow maximum 512 threads per block. 
    * Current GPUs allow maximum 1024 threads per block 
    */ 

    dim3 block_size(16,16); 

    /* 
    * Specify the grid size for the GPU. 
    * Make it generalized, so that the size of grid changes according to the input image size 
    */ 

    dim3 grid_size; 
    grid_size.x = (width + block_size.x - 1)/block_size.x; /*< Greater than or equal to image width */ 
    grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */ 

    //Launch the kernel 
    box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight); 

    //Copy the results back to CPU 
    cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost); 

    //Release the texture 
    cudaUnbindTexture(tex8u); 

    //Free GPU memory 
    cudaFree(GPU_input); 
    cudaFree(GPU_output); 
}

好消息是，你不必執行過濾自己。 CUDA Toolkit附帶由NVIDIA製造的名爲NVIDIA Performance Primitives aka NPP的免費信號和圖像處理庫。 NPP使用支持CUDA的GPU來加速處理。平均過濾器已在NPP中實施。當前版本的NPP（5.0）支持8位，1通道和4通道圖像。的功能是：

nppiFilterBox_8u_C1R 1通道圖像。
nppiFilterBox_8u_C4R 4通道圖像。

來源

2013-01-15 10:50:59 sgarizvi

你的答案似乎非常好，但我並沒有真正意識到你在那裏描述的是什麼，因爲我主要在matlab上編程，並且我對C編程有很好的瞭解，我需要的是代碼幫助，我認爲內核函數原型是： '__global__ void ApplyAverageFilter（int ** Image，int ** Result，int filterSize）;'，我需要代碼的幫助。 –

哦。我已經更新了我的答案，併爲CUDA內核添加了一個鏈接來進行框式過濾。但是你必須先學習CUDA才能使用它。否則，如果您沒有太多的CUDA背景，NPP是更好的選擇。 – sgarizvi

我認爲你的答案對於現在的問題已經足夠了，非常感謝:) –

幾個基本想法/步驟：

複製來自CPU的圖像數據傳送到GPU
調用內核來構建平均每行（水平）並將其存儲在共享存儲器中。
調用內核來構建每列（垂直）的平均值並將其存儲在全局內存中。
將數據複製回CPU內存。

你應該能夠與2D內存和多維內核調用擴展這個漂亮容易。

來源

2013-01-15 09:26:17 Michael

如果過濾器的大小是正常的並且不是很大，那麼平均過濾器是使用CUDA實施的一個非常好的案例。您可以使用方塊設置它，並且塊的每個線程都負責計算一個像素的值，方法是對其鄰域進行求和和平均。

如果將圖像存儲在全局內存中，那麼它可以很容易地編程，但是會產生很多銀行衝突。一種可能的優化是將圖像的塊加載到塊的共享內存中。使用幻像元素（以便在查找相鄰像素時不會超出共享塊的尺寸），可以計算塊內像素的平均值。

唯一需要注意的是如何在最後完成「拼接」，因爲共享內存塊會重疊（由於多餘的「填充」像素），而且您不希望兩次計算它們的值。

來源

2013-01-15 09:27:20

Cuda圖像平均濾波器

回答

示例實現箱過濾

相關問題