我有關於如何處理大矩陣的一些問題。就像解釋in this other question我有一個程序可以在大的矩陣上工作(比如5k-10k)。計算部分是正確的(仍然不是100%優化),我用較小的方形矩陣(如256-512)進行了測試。這裏是我的代碼:cuda大矩陣和塊/線程
#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512
//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
unsigned int sum = 0, c = 0;
//some work with radius and stuff
return sum;
}
__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
unsigned int stride = blockDim.x * gridDim.x;
int col = tid%N, row = (int)tid/N;
while(tid < N*N) {
if(row * col < N * N) {
//choose which of the __device__ functions needs to be launched
}
tid += stride;
col = tid%N, row = (int)tid/N;
}
__syncthreads();
}
int main(void) {
cudaError_t err;
unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
int i, j;
base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}
for(i = 0; i < N * N; i++) {
base[i] = (unsigned char)(rand() % 256);
}
err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}
//more 'light' stuff to do before the 'heavy computation'
applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>(d_thresh, d_avg);
err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}
getchar();
return 0;
}
當啓動一個大矩陣的問題(如10000×10000)和100爲半徑(這是從矩陣我向前看的每一個點是如何「遠」),它需要如此多的時間。
我相信問題在於applyAvg<<<(N + THREADS - 1)/THREADS, THREADS>>>
(我決定運行多少個塊和線程)以及applyAvg(...)
方法(跨度和tid)。 有人能澄清我哪個是決定要發射多少塊/線程的最好方法,因爲矩陣可以從5k到10k不等。
謝謝您的建議。是的,這是一種過濾。我真的搞砸了塊/線程的工作方式......可以說我有一個10000 x 10000的矩陣,每個像素需要過濾,這是啓動我的內核的最佳方式?目前我正在啓動大約20個塊,每塊有512個線程,對吧?還有另一種更好地使用65k塊的方法嗎?或者更高性能的使用線程?我對CUDA很陌生,一切看起來都是如此混亂每塊xD – n0n4m3
對於2D任務,我建議你使用16x16線程/塊和625x625塊/網格。沒有必要最大化塊的數量。在這裏看到一維任務。 http://stackoverflow.com/questions/19422993/how-to-chose-value-of-block-and-thread-in-cuda/19423751#19423751 – kangshiyin
再次感謝你,你真的很有幫助。我要讀你與我聯繫的論文......你應該得到綠色的答案;) – n0n4m3