讓我從這篇文章道歉開始。我知道有幾個帖子提出了同樣的問題,但我已經嘗試了所給出的解決方案,而且我仍然無法獲得CUDA矩陣乘法的正確結果。CUDA矩陣乘法不正確的結果
從我遵循的示例中,我很確定我的內核算法是正確的。我不相信將2D數組傳遞給內核時有任何問題,並且當它們通過引用傳遞時,我覺得在數組打印到主機中時,2D解決方案數組應該包含正確的答案,但事實並非如此。
難道這是我的dim3 dimGrid(B,B)和dim3 dimThreads(T,T)變量的問題嗎?我是CUDA框架的新手,我仍然試圖圍繞它進行研究。任何建議將非常感激。我的代碼如下:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
__global__ void MatMultiply (int *a, int *b, int *c, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int val = 0;
for (int e = 0; e < N; ++e) {
val += a[row*N + e] * b[e*N + col];
}
c[row*N+col] = val;
}
int main(void) {
int N, B, T;
printf("Input integer for matrix dimension size: ");
scanf("%d", &N);
printf("Input number of threads in a block: ");
scanf("%d", &T);
printf("Input number of blocks in a grid: ");
scanf("%d", &B);
int size = N * N * sizeof(int);
int *a, *b, *c;
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
a[i*N+j] = j + i*N;
b[i*N+j] = j + i*N;
c[i*N+j] = j + i*N;
}
}
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);
dim3 dimGrid(B, B);
dim3 dimThreads(T, T);
MatMultiply<<<B, T>>>(dev_a,dev_b,dev_c, N);
cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%d\t", b[i*N + j]);
}
printf("\n");
}
free(a);
free(b);
free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
再次感謝。
此外,最後,您打印出矩陣'b',這是您的輸入矩陣之一。您可能想要打印出'c'。 – 2013-04-22 22:59:52
謝謝。我不知道我是如何錯過的。現在一切似乎都在起作用。 – Chris 2013-04-23 04:23:12