0
我正在嘗試編寫一個簡單的矩陣乘法應用程序,使用CUDA乘以兩個方陣。我有一個問題,我的內核只能在網格的塊(0,0)中正確計算。我的內核只適用於塊(0,0)
這是我的調用代碼:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
這是我的內核函數
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
我認爲這個問題可能是與記憶,但我有點失落。我應該怎麼做才能讓這些代碼跨越幾個塊?
經過並做手工乘法後,我發現並非所有的值都是正確的。我想我可能會遇到索引派生的問題。 – ZeroDivide 2010-06-09 19:02:54