我有我的CUDA程序中的一個未知的錯誤,它似乎被相關atomicadd功能。我編碼在Windows上的Visual Studio 2015年我的調用函數指定爲以下CUDA問題atomicadd在嵌套循環(> 760通過760)
int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);
cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);
crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight, dev_count);
cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);
printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");
那麼我的全球核函數
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
sumGlobal(regionWidth, regionHeight, &shared_sum);
atomicAdd(ct, 1);
}
與sumGlobal定義爲
__device__
void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
{
// sum in nested loop
for (int y = 0; y < regionHeight; y++)
for (int x = 0; x < regionWidth; x++)
atomicAdd(global_sum, 1);
}
的建立從程序輸出在下面的
1> H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio
14.0\VC\bin\x86_amd64" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --
machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE
-D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o
x64\Release\targetDetectionGPU.cu.obj
"H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"
這是一個標準的NVIDIA CUDA控制檯項目,不僅改變了拱sm_50,compute_50
我的程序的輸出如下(調試信息)
sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version: 9.0
CUDA Runtime Version: 8.0
CUDA Capability Major revision number: 6
CUDA Capability Minor revision number: 1
Total amount of global memory: 0 bytes
Number of multiprocessors: 6
Number of cores: 288
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.39 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host
threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API
error (30): unknown error
finshed cuda algorithm
較小的網格大小,它似乎工作更好
所以當我轉而選擇764,764網格的大小,我得到
Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called ->
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API
error (30): unknown error
750 X 750錯誤消失了,760x760錯誤又回來了。
設備規範允許更大的網格大小比765,還是我失去了一些東西?不知道爲什麼嵌套循環中的一個簡單的atomicAdd應該導致這些錯誤,這是一個錯誤?
好了,現在簡化了內核調用,去掉了函數調用,並結合對較大的網格大小的環插入1,但仍是錯誤的,如果我註釋掉它運行正常循環。
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
__syncthreads();
for (int y = 0; y < regionHeight*regionWidth; y++)
atomicAdd(&shared_sum, 1);
__syncthreads();
atomicAdd(ct, 1);
}
如果我縮短循環來
for (int y = 0; y < regionHeight; y++)
atomicAdd(&shared_sum, 1);
那麼它工作正常,似乎是一個超時問題,奇怪,因爲我與NSight顯示器設置WDDM TDR超時時間爲10秒。
1.使用[適當CUDA錯誤檢查(https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using- the-cuda-runtime-api)2.用'cuda-memcheck'運行你的代碼3.提供一個[mcve]。根據SO預期[這裏](https:// stackoverflow。com/help/on-topic)(< - 點擊這裏並閱讀第1項),這是這樣的問題所必需的。這當然確實看起來像超時問題,也許10秒是不夠的。進行超時更改後您需要重新啓動。 –
我用cuda-memcheck調試器停止在atomicadd函數中調試。我認爲這個例子是非常小的和可驗證的(你認爲什麼並不是最小的?)是的,我重新啓動,程序不會在崩潰前10秒附近掛起,更像1秒。 – ejectamenta
組裝好你所提供的零件後,添加一些缺少的代碼行,我就能夠從你所展示的內容中構建出一些東西。如果我構建一個調試項目,內核需要10秒以上才能在Pascal Titan X(即快速GPU)上執行。如果在執行調試項目1秒鐘後出現錯誤,則表示正在等待超時。但是,當我運行你的代碼時,我沒有任何錯誤。所以我有理由相信你正在超時,並且你的TDR沒有被正確修改。 –