Cuda編程直方圖

我想運行一個cuda程序，但我是初學者。我必須編寫一個直方圖的程序。但與桶。根據maxValue（示例中的40），該數字將被添加到相應的存儲桶中。如果我們有4個桶：Cuda編程直方圖

histo：| | 1 | 10 | 30 | 39 | 32 | 2 | 4 | 5 | 1 |

0-9（第一桶）

10-19（第二桶）

20-29（第三鬥）

30- 39（第四鬥）

我的GPU有計算能力1.1。

我試圖這樣做具有共享臨時[]對於每個線程上添加他的臨時表中他的值的塊：

__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, 
           unsigned int *histo) 
{ 
    extern __shared__ unsigned int temp[]; 
    temp[threadIdx.x] = 0; 
    __syncthreads(); 

    int i = threadIdx.x + blockIdx.x * blockDim.x; 
    int offset = blockDim.x * gridDim.x; 
    int bucketID; 
    while (i < size) 
    { 
       bucketID = array[i]/Bwidth; 
       atomicAdd(&temp[bucketID], 1); 
       i += offset; 
    } 
    __syncthreads(); 


    atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]); 
} 

histo_kernel_optimized <<<array_size/buckets, buckets,buckets*sizeof(unsigned int)>>>(buffer,SIZE, histogram)

但是，編譯器的SAI：指令「{原子，紅色}。共享'需要。 目標sm_12或更高

我也嘗試已經爲每個線程臨時表創建：

__global__ void histo_kernel_optimized5(unsigned char *buffer, long size, 
           unsigned int *histo) 
{ 
    unsigned int temp[buckets]; 
    int j; 
    for (j=0;j<buckets;j++){ 
     temp[j]=0; 
    } 

    int bucketID; 

    int i = threadIdx.x + blockIdx.x * blockDim.x; 
    int offset = blockDim.x * gridDim.x; 
    while (i < size) 
    { 
     bucketID = array[i]/Bwidth; 
     temp[bucketID]++; 
     i += offset; 
    } 


    for (j=0;j<buckets;j++){ 
     histo[j] += temp[j];  
    } 
}

但因爲它需要一個不斷到創建臨時編譯器不要讓我做表。但問題是存儲區是動態的給出的命令行。

有沒有另一種方法呢？我不知道該怎麼做。我很困惑。

來源

2013-04-03 Andreas Lympouras

當使用原子能，發射更少的塊將減少爭（並因此提高性能），因爲它不會有更少的塊之間進行協調。啓動較少的塊並使每個塊循環多個輸入元素。

for (unsigned tid = blockIdx.x*blockDim.x+threadIdx.x; 
       tid < size; tid += gridDim.x*blockDim.x) { 
    unsigned char value = array[tid]; // borrowing notation from another answer here 
    int bin = value % buckets; 
    atomicAdd(&histo[bin],1); 
}

來源

2013-04-04 16:06:58 jgmalcolm

直方圖真的很容易實現使用原子操作。我不知道你爲什麼寫這麼複雜的內核。並行化操作的動機是利用算法的並行性。不需要遍歷內核中的整個直方圖。下面是一個示例CUDA內核和包裝函數，用於計算具有指定數量的倉的數組的直方圖。我不認爲它可以進一步優化計算1.1設備。但對於Compute 1.2，可以使用共享內存。

__global__ void kernel_getHist(unsigned char* array, long size, unsigned int* histo, int buckets) 
{ 
    int tid = blockIdx.x * blockDim.x + threadIdx.x; 

    if(tid>=size) return; 

    unsigned char value = array[tid]; 

    int bin = value % buckets; 

    atomicAdd(&histo[bin],1); 
} 

void getHist(unsigned char* array, long size, unsigned int* histo,int buckets) 
{ 
    unsigned char* dArray; 
    cudaMalloc(&dArray,size); 
    cudaMemcpy(dArray,array,size,cudaMemcpyHostToDevice); 

    unsigned int* dHist; 
    cudaMalloc(&dHist,buckets * sizeof(int)); 
    cudaMemset(dHist,0,buckets * sizeof(int)); 

    dim3 block(32); 
    dim3 grid((size + block.x - 1)/block.x); 

    kernel_getHist<<<grid,block>>>(dArray,size,dHist,buckets); 

    cudaMemcpy(histo,dHist,buckets * sizeof(int),cudaMemcpyDeviceToHost); 

    cudaFree(dArray); 
    cudaFree(dHist); 
}

來源

2013-04-03 11:08:06 sgarizvi

爲什麼要將一個塊大小添加到大小：grid（（size + block.x - 1）/block.x）; –

因此，線程的總數是** atleast **等於'size'。此公式將線程總數加總爲大於或等於'size'的塊大小的倍數。選擇'size'的值並計算自己看到的線程總數。 – sgarizvi

是的，你是對的！但是我沒有得到相同的結果，平行直方圖[]和串行直方圖[]是不同的！我不知道爲什麼，我用你的代碼，因爲它！ –

沒有爲沒有原子操作設備的解決方案，並示出了在Histogram calculation in CUDA

的代碼的方法來最小化片上存儲器衝突，與細分成由Podlozhnyuk proproused經紗是在CUDASamples \ 3_Imaging \直方圖（從CUDA示例）

來源

2016-05-29 16:25:54

從[幫助中心]（http://stackoverflow.com/help/how-to-answer）：鼓勵連接到外部資源，但請在鏈接上添加上下文，以便您的同行用戶瞭解它是什麼以及爲什麼它在那裏。如果目標網站無法訪問或永久離線，請始終引用重要鏈接中最相關的部分。 – Adam

背景：針對沒有Atomic Operations的設備提供了一種解決方案，並展示了一種最小化片上存儲器衝突的方法，其中細分爲經線。 –

這不是我投票失敗。其實，我的評論有2票，所以我想它是有用的。 – Adam

Cuda編程直方圖

回答

相關問題