cuda大矩陣和塊/線程

我有關於如何處理大矩陣的一些問題。就像解釋in this other question我有一個程序可以在大的矩陣上工作（比如5k-10k）。計算部分是正確的（仍然不是100％優化），我用較小的方形矩陣（如256-512）進行了測試。這裏是我的代碼：cuda大矩陣和塊/線程

#define N 10000 
#define RADIUS 100 
#define SQRADIUS RADIUS*RADIUS 
#define THREADS 512 

//many of these device functions are declared 
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) { 
    unsigned int sum = 0, c = 0; 

    //some work with radius and stuff 

    return sum; 
} 

__global__ void applyAvg(const unsigned char *src, unsigned char *dest) { 
    unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0; 
    unsigned int stride = blockDim.x * gridDim.x; 
    int col = tid%N, row = (int)tid/N; 

    while(tid < N*N) { 
     if(row * col < N * N) { 
      //choose which of the __device__ functions needs to be launched 
     } 

     tid += stride; 
     col = tid%N, row = (int)tid/N; 
    } 
    __syncthreads(); 
} 

int main(void) { 
    cudaError_t err; 
    unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg; 
    int i, j; 

    base = (unsigned char*)malloc((N * N) * sizeof(unsigned char)); 
    thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char)); 
    avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char)); 

    err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char)); 
    if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);} 
    err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char)); 
    if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);} 
    err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char)); 
    if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);} 

    for(i = 0; i < N * N; i++) { 
     base[i] = (unsigned char)(rand() % 256); 
    } 

    err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice); 
    if(err != cudaSuccess){printf("ERROR 4"); exit(-1);} 

    //more 'light' stuff to do before the 'heavy computation' 

    applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>(d_thresh, d_avg); 

    err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost); 
    if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);} 
    err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost); 
    if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);} 

    getchar(); 
    return 0; 
}

當啓動一個大矩陣的問題（如10000×10000）和100爲半徑（這是從矩陣我向前看的每一個點是如何「遠」），它需要如此多的時間。

我相信問題在於applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>（我決定運行多少個塊和線程）以及applyAvg(...)方法（跨度和tid）。有人能澄清我哪個是決定要發射多少塊/線程的最好方法，因爲矩陣可以從5k到10k不等。

來源

2013-10-29 n0n4m3

我想你想要做的是圖像過濾/卷積。根據你當前的cuda內核，你可以做兩件事來提高性能。

使用2-d線/塊，以避免/和%運算符。他們非常緩慢。
使用共享內存來減少全局內存帶寬。

這是關於圖像卷積的白皮書。它展示瞭如何使用CUDA實現高性能盒子過濾器。

http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf

Nvidia的cuNPP庫還提供了功能nppiFilterBox（）和nppiFilterBox（），所以你不需要編寫自己的內核。這是文檔和示例。

http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp

NPP DOC pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

來源

2013-10-30 03:39:12 kangshiyin

謝謝您的建議。是的，這是一種過濾。我真的搞砸了塊/線程的工作方式......可以說我有一個10000 x 10000的矩陣，每個像素需要過濾，這是啓動我的內核的最佳方式？目前我正在啓動大約20個塊，每塊有512個線程，對吧？還有另一種更好地使用65k塊的方法嗎？或者更高性能的使用線程？我對CUDA很陌生，一切看起來都是如此混亂每塊xD – n0n4m3

對於2D任務，我建議你使用16x16線程/塊和625x625塊/網格。沒有必要最大化塊的數量。在這裏看到一維任務。 http://stackoverflow.com/questions/19422993/how-to-chose-value-of-block-and-thread-in-cuda/19423751#19423751 – kangshiyin

再次感謝你，你真的很有幫助。我要讀你與我聯繫的論文......你應該得到綠色的答案;） – n0n4m3

cuda大矩陣和塊/線程

回答

相關問題