CUDA問題atomicadd在嵌套循環（> 760通過760）

我有我的CUDA程序中的一個未知的錯誤，它似乎被相關atomicadd功能。我編碼在Windows上的Visual Studio 2015年我的調用函數指定爲以下CUDA問題atomicadd在嵌套循環（> 760通過760）

int regionWidth=32; 
int regionHeight=32; 
dim3 gridSize(765,765); 
dim3 blockSize(regionWidth, regionHeight); 

cudaMalloc((void **)&dev_count, sizeof(int)); 
count = 0; 
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice); 

crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight, dev_count); 

cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost); 

printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");

那麼我的全球核函數

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct) 
{ 
    __shared__ int shared_sum; 

    shared_sum = 0; 

    sumGlobal(regionWidth, regionHeight, &shared_sum); 

    atomicAdd(ct, 1); 
}

與sumGlobal定義爲

__device__ 
void sumGlobal(int regionWidth, int regionHeight, int* global_sum) 
{ 
    // sum in nested loop 
    for (int y = 0; y < regionHeight; y++) 
     for (int x = 0; x < regionWidth; x++) 
       atomicAdd(global_sum, 1); 
}

的建立從程序輸出在下面的

1> H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU 
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" - 
gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl- 
version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 
14.0\VC\bin\x86_amd64" -I"C:\Program Files\NVIDIA GPU Computing 
Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing 
Toolkit\CUDA\v8.0\include"  --keep-dir x64\Release -maxrregcount=0 -- 
machine 64 --compile -cudart static  -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE 
-D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o 
x64\Release\targetDetectionGPU.cu.obj 
"H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"

這是一個標準的NVIDIA CUDA控制檯項目，不僅改變了拱sm_50，compute_50

我的程序的輸出如下（調試信息）

sharedMemBytes=36864 
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32 
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32 
There is 1 device supporting CUDA 

Device 0: "GeForce GTX 1050 Ti" 
    CUDA Driver Version:       9.0 
    CUDA Runtime Version:       8.0 
    CUDA Capability Major revision number:   6 
    CUDA Capability Minor revision number:   1 
    Total amount of global memory:     0 bytes 
    Number of multiprocessors:      6 
    Number of cores:        288 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per block:   1024 
    Maximum sizes of each dimension of a block: 1024 x 1024 x 64 
    Maximum sizes of each dimension of a grid:  2147483647 x 65535 x 65535 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Clock rate:         1.39 GHz 
    Concurrent copy and execution:     Yes 
    Run time limit on kernels:      Yes 
    Integrated:         No 
    Support host page-locked memory mapping:  Yes 
    Compute mode:         Default (multiple host    
    threads can use this device simultaneously) 
    Concurrent kernel execution:     Yes 
    Device has ECC support enabled:    No 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime 
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti 
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB 
total number of threads that executed was: 0 vs. 599270400 called -> error 
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API 
error (30): unknown error 
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API 
error (30): unknown error 
finshed cuda algorithm

較小的網格大小，它似乎工作更好

所以當我轉而選擇764，764網格的大小，我得到

Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB 
total number of threads that executed was: 597704704 vs. 597704704 called -> 
ok 
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API 
error (30): unknown error

750 X 750錯誤消失了，760x760錯誤又回來了。

設備規範允許更大的網格大小比765，還是我失去了一些東西？不知道爲什麼嵌套循環中的一個簡單的atomicAdd應該導致這些錯誤，這是一個錯誤？

好了，現在簡化了內核調用，去掉了函數調用，並結合對較大的網格大小的環插入1，但仍是錯誤的，如果我註釋掉它運行正常循環。

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct) 
{ 
    __shared__ int shared_sum; 

    shared_sum = 0; 
    __syncthreads(); 

    for (int y = 0; y < regionHeight*regionWidth; y++) 
      atomicAdd(&shared_sum, 1); 

    __syncthreads(); 

    atomicAdd(ct, 1); 
}

如果我縮短循環來

for (int y = 0; y < regionHeight; y++) 
      atomicAdd(&shared_sum, 1);

那麼它工作正常，似乎是一個超時問題，奇怪，因爲我與NSight顯示器設置WDDM TDR超時時間爲10秒。

來源

2017-09-20 ejectamenta

1.使用[適當CUDA錯誤檢查（https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using- the-cuda-runtime-api）2.用'cuda-memcheck'運行你的代碼3.提供一個[mcve]。根據SO預期[這裏]（https：// stackoverflow。com/help/on-topic）（< - 點擊這裏並閱讀第1項），這是這樣的問題所必需的。這當然確實看起來像超時問題，也許10秒是不夠的。進行超時更改後您需要重新啓動。 –

我用cuda-memcheck調試器停止在atomicadd函數中調試。我認爲這個例子是非常小的和可驗證的（你認爲什麼並不是最小的？）是的，我重新啓動，程序不會在崩潰前10秒附近掛起，更像1秒。 – ejectamenta

組裝好你所提供的零件後，添加一些缺少的代碼行，我就能夠從你所展示的內容中構建出一些東西。如果我構建一個調試項目，內核需要10秒以上才能在Pascal Titan X（即快速GPU）上執行。如果在執行調試項目1秒鐘後出現錯誤，則表示正在等待超時。但是，當我運行你的代碼時，我沒有任何錯誤。所以我有理由相信你正在超時，並且你的TDR沒有被正確修改。 –

如果你得到一個「錯誤（30）：未知錯誤」懷疑TDR超時，特別是在Windows上。基本上我的測試程序在循環中花了很長時間，導致超時。當您使用printf語句進行調試時尤其如此！

的解決方案是通過改變TDR設置到更象30秒這麼增加超時值，當不使用用於在主顯示器的GPU卡增加此值是沒有問題的。當TDR值增加時，您可以更好地看到它是您的程序耗時太長而不是別的。嘗試通過刪除循環來改進代碼，尤其是那些包含原子操作的循環，或者重構它以使用像reduction這樣的技術。

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

來源

2017-09-21 14:40:55 ejectamenta

CUDA問題atomicadd在嵌套循環（> 760通過760）

回答

相關問題