當我的數組大小超過591(591)時啓動簡單內核時遇到問題。大小爲591x591時,數組返回時沒有任何錯誤,但是儘快我以每個16x16線程的38x38塊的網格維度啓動內核,內核無法啓動並返回「未知錯誤」。啓動大內核大小時出現未知錯誤
下面的代碼是我打電話內核和在我的代碼調用內核:
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_device_runtime_api.h>
using namespace std;
#define BLOCKSIZE 16
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__,__LINE__);}
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if(abort) exit(code);
}
}
__global__ void IdentityMatrixKernel(float* identity, int size)
{
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int index = index_y * grid_width + index_x;
// map the two 2D block indices to a single linear, 1D block index
//int result = blockIdx.y * gridDim.x + blockIdx.x;
if (index % (size+1))
{
identity[index] = 0;
}
else
{
identity[index] = 1;
}
void foo(float *aArray, int size)
{
float* d_I;
int size2 = size*size*sizeof(float);
gpuErrchk(cudaMalloc(&d_I,size2));
dim3 block_size;
block_size.x = BLOCKSIZE;
block_size.y = BLOCKSIZE;
dim3 grid_size;
grid_size.x = size1/ block_size.x + 1;
grid_size.y = size1/ block_size.y + 1;
IdentityMatrixKernel<<<grid_size,block_size>>>(d_I,size);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemcpy(aArray,d_I,size2,cudaMemcpyDeviceToHost));
cudaFree(d_I);
}
int main()
{
int size = 591;
float *aArray = (float*)malloc(size*size*sizeof(float));
foo(aArray,size);
return 0;
}
對於size = 591
沒有錯誤顯示出來,輸出尺寸591x591的單位矩陣,但對於任何較大尺寸它吐出向控制檯輸出「未知錯誤」。
我想這不是你正在運行的代碼。有各種編譯問題。請檢查以確保您發佈的代碼將實際編譯並解決任何問題,然後確保它實際上證明了問題。然後用'cuda-memcheck'運行你的代碼,我想你會發現你的內核正在產生很多錯誤(例如超出界限 - 大小爲4的無效全局寫入等)。 –