我的內核只適用於塊（0,0）

我正在嘗試編寫一個簡單的矩陣乘法應用程序，使用CUDA乘以兩個方陣。我有一個問題，我的內核只能在網格的塊（0,0）中正確計算。我的內核只適用於塊（0,0）

這是我的調用代碼：

dim3 dimBlock(4,4,1); 
dim3 dimGrid(4,4,1); 
//Launch the kernel; 
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

這是我的內核函數

__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width) 
{ 
     const int tx = threadIdx.x; 
     const int ty = threadIdx.y; 
     const int bx = blockIdx.x; 
     const int by = blockIdx.y; 
     const int row = (by * blockDim.y + ty); 
     const int col = (bx * blockDim.x + tx); 

     //Pvalue stores the Pd element that is computed by the thread 
     int Pvalue = 0; 

     for (int k = 0; k < Width; k++) 
     { 
      Pvalue += Md[row * Width + k] * Nd[k * Width + col]; 
     } 
     __syncthreads(); 
     //Write the matrix to device memory each thread writes one element 
     Pd[row * Width + col] = Pvalue; 

    }

我認爲這個問題可能是與記憶，但我有點失落。我應該怎麼做才能讓這些代碼跨越幾個塊？

來源

2010-06-09 ZeroDivide

經過並做手工乘法後，我發現並非所有的值都是正確的。我想我可能會遇到索引派生的問題。 – ZeroDivide 2010-06-09 19:02:54

問題在於我的CUDA內核調用。網格對於正在處理的矩陣太小了。

來源

2010-06-09 22:45:44 ZeroDivide

我的內核只適用於塊（0,0）

回答

相關問題