我使用CUDA 3.2和VS 2008實現了以下矩陣乘法代碼。我在Windows Server 2008 R2企業版上運行。我正在運行Nvidia GTX 480.以下代碼可以很好地處理「寬度」(矩陣寬度)值高達2500左右的值。對於大型矩陣,CUDA矩陣乘法中斷
int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;
//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);
//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);
MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
當我設置的「寬」到3000或更高,我得到一個黑色的屏幕後,出現以下錯誤:
我在網上看了一下,我看到有些人有這個問題,因爲看門狗在掛起超過5秒後殺死內核。我試圖在註冊表中編輯「TdrDelay」,這會延遲黑屏和同樣錯誤出現之前的時間。所以我認爲這不是我的問題。
我調試到我的代碼,發現此行是罪魁禍首:
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
這是我用回我的結果從設備設置我的矩陣乘法內核函數被調用後。一直到這一點似乎運行良好。我相信我正確地分配內存,並不知道爲什麼會發生這種情況。我想也許我沒有足夠的內存在我的卡上,但不應該cudaMalloc返回錯誤? (我確認它沒有在調試時)。
任何想法/援助將不勝感激!...謝謝很多傢伙!
內核代碼:
//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;
//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;
for (int i = 0; i < Width; ++i)
{
float Mdelement = Md[Row * Width + i];
float Ndelement = Nd[i * Width + Column];
Pvalue += Mdelement * Ndelement;
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}
我也有使用共享內存此等功能,同時也給出了同樣的錯誤:
電話:
MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);
內核代碼:
//Matrix Multiplication Kernel - Shared Memory Implementation
__global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];
int tx = threadIdx.x;
int ty = threadIdx.y;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;
//For each tile, load the element into shared memory
for(int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column];
__syncthreads();
for(int j = 0; j < TileWidth; ++j)
{
Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
}
__syncthreads();
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}
請問您可以發佈內核代碼嗎? – Tom 2010-11-03 15:59:39
編輯:添加內核代碼 – ntsue 2010-11-04 03:52:19
編輯:添加了兩個內核代碼功能 – ntsue 2010-11-04 12:22:37