C++：簡單的CUDA卷重構代碼崩潰

-1

我目前正在開發一個涉及CUDA的更全面的項目。在最近幾天裏，我一直在遇到錯誤，我一直在拼命地嘗試bug修復。但是，我無法弄清楚，所以現在我構成了一個最小的例子，它顯示了相同的行爲。我不得不說我對CUDA很陌生。我正在使用Visual Studio 2015和CUDA Toolkit 7.5。C++：簡單的CUDA卷重構代碼崩潰

該程序涉及在GPU內存上創建3D卷，然後計算值並將其寫入卷。我試圖使代碼儘可能簡單：

首先IST的main.cpp文件：

#include "cuda_test.h" 

int main() { 

    size_t const xDimension = 500; 
    size_t const yDimension = 500; 
    size_t const zDimension = 1000; 

    //allocate volume part memory on gpu 
    cudaPitchedPtr volume = ct::cuda::create3dVolumeOnGPU(xDimension, yDimension, zDimension); 

    //start reconstruction 
    ct::cuda::startReconstruction(volume, 
            xDimension, 
            yDimension, 
            zDimension); 

return 0;

}

然後cuda_test.h這是實際.CU文件頭文件：

#ifndef CT_CUDA 
#define CT_CUDA 

#include <cstdlib> 
#include <stdio.h> 
#include <cmath> 

//CUDA 
#include <cuda_runtime.h> 

namespace ct { 

    namespace cuda { 

     cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize); 
     void startReconstruction(cudaPitchedPtr volume, 
           size_t xSize, 
           size_t ySize, 
           size_t zSize); 

    } 

} 

#endif

包含行爲

然後是cuda_test.cu文件UAL功能實現：

#include "cuda_test.h" 

namespace ct { 

    namespace cuda { 

     cudaPitchedPtr create3dVolumeOnGPU(size_t xSize, size_t ySize, size_t zSize) { 
      cudaExtent extent = make_cudaExtent(xSize * sizeof(float), ySize, zSize); 
      cudaPitchedPtr ptr; 
      cudaMalloc3D(&ptr, extent); 
      printf("malloc3D: %s\n", cudaGetErrorString(cudaGetLastError())); 
      cudaMemset3D(ptr, 0, extent); 
      printf("memset: %s\n", cudaGetErrorString(cudaGetLastError())); 
      return ptr; 
     } 

     __device__ void addToVolumeElement(cudaPitchedPtr volumePtr, size_t ySize, size_t xCoord, size_t yCoord, size_t zCoord, float value) { 
      char* devicePtr = (char*)(volumePtr.ptr); 
      //z * xSize * ySize + y * xSize + x 
      size_t pitch = volumePtr.pitch; 
      size_t slicePitch = pitch * ySize; 
      char* slice = devicePtr + zCoord*slicePitch; 
      float* row = (float*)(slice + yCoord * pitch); 
      row[xCoord] += value; 
     } 

     __global__ void reconstructionKernel(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) { 

      size_t xIndex = blockIdx.x; 
      size_t yIndex = blockIdx.y; 
      size_t zIndex = blockIdx.z; 

      if (xIndex == 0 && yIndex == 0 && zIndex == 0) { 
       printf("kernel start\n"); 
      } 

      //just make sure we're inside the volume bounds 
      if (xIndex < xSize && yIndex < ySize && zIndex < zSize) { 

       //float value = z; 
       float value = sqrt(sqrt(sqrt(5.3))) * sqrt(sqrt(sqrt(1.2))) * sqrt(sqrt(sqrt(10.8))) + 501 * 0.125 * 0.786/5.3; 

       addToVolumeElement(volumePtr, ySize, xIndex, yIndex, zIndex, value); 

      } 

      if (xIndex == 0 && yIndex == 0 && zIndex == 0) { 
       printf("kernel end\n"); 
      } 

     } 

     void startReconstruction(cudaPitchedPtr volumePtr, size_t xSize, size_t ySize, size_t zSize) { 
      dim3 blocks(xSize, ySize, zSize); 
      reconstructionKernel <<< blocks, 1 >>>(volumePtr, 
                xSize, 
                ySize, 
                zSize); 
      printf("Kernel launch: %s\n", cudaGetErrorString(cudaGetLastError())); 
      cudaDeviceSynchronize(); 
      printf("Device synchronise: %s\n", cudaGetErrorString(cudaGetLastError())); 
     } 

    } 

}

功能create3dVolumeOnGPU分配在GPU存儲器3維「音量」，並返回一個指向它的指針。這是一個主機功能。第二個主機功能是startReconstruction。它所做的唯一的事情就是啓動實際的內核，使用與卷中的體素一樣多的塊。內核函數是reconstructionKernel。它只是計算一些常數中的任意值，然後調用addToVolumeElement（設備函數）將結果寫入相應的體素（添加它）。

現在，問題是它崩潰了。如果我和調試器（NSight）推出，NSight中斷給錯誤消息：

CUDA grid launch failed: CUcontext: 2358451327088 CUmodule: 2358541519888 Function: _ZN2ct4cuda20reconstructionKernelE14cudaPitchedPtryyy

控制檯輸出：

malloc3D: no error 
memset: no error 
kernel started 
kernel end

如果我在釋放模式啓動整個機器復位。

但是，如果我改變體積的尺寸要小一些它的作品，例如：

size_t const xDimension = 100; 
    size_t const yDimension = 100; 
    size_t const zDimension = 100;

然而，自由GPU內存的數量不應該是問題（卡有4GB VRAM）。

這將是很好，如果有人可以看看它，也許給我一個小費可能會導致問題。現在

來源

2016-02-13 user1488118

好吧，因爲它似乎是一個問題，我只使用塊，每塊只有1個線程。但爲什麼？ – user1488118

您可能會遇到[WDDM TDR問題]（http://http.developer.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGuide/HTML/Content/Timeout_Detection_Recovery.htm）。 –

好吧，我得看看這個。因爲看起來我的問題已經通過每塊使用多個線程來解決。 – user1488118

，問題是它崩潰

這將是很好，如果有人可以看看它，也許給我一個提示，是什麼引發的問題。

我想這很可能是您碰到a WDDM TDR issue。在Windows上，任何時候在WDDM GPU上運行的內核執行時間都需要大約2秒鐘，您可能會遇到WDDM TDR看門狗（假設您沒有對看門狗進行任何更改）。

此外，啓動內核是這樣的：

reconstructionKernel <<< blocks, 1 >>>(...);

其中線程每塊數爲1，意味着只有一個在各經紗（以及在每個塊中）線程是活動的。但GPU喜歡每個warp有32個活動線程。所以淨效應是GPU資源的低效利用;也許，當你運行的內核這樣的GPU馬力高達97％閒置。

所以，如果你的代碼是非常靈活，允許這樣的：

reconstructionKernel <<< blocks, 1 >>>(...);

或等價的：

reconstructionKernel <<< blocks/256, 256 >>>(...);

（這只是一個代表性的例子，我知道你有一個多維網格，上述可能不適合你的情況正是相關）

然後第二次調用方法幾乎肯定會更有效率，導致縮短執行時間同樣的工作。

所以我相信，當你每塊多線程測試你的代碼，你不喜歡的東西上面，它低於該TDR限制的執行時間。

這是一個完全正常的解決方案，但如果你最終增加更多的工作，你的內核（更多的總線程，或每個線程更多的工作），那麼你可以再次運行到了極限。在這種情況下，鏈接的文章解釋了可能的解決方法。

順便說一句，內核啓動配置是這樣的：

kernel<<<1, ?>>>(...);

或本：

kernel<<<?, 1>>>(...);

是從未建議在GPU高性能的代碼。

來源

2016-02-15 15:15:36

反正我要添加多個線程。我剛開始每塊有一個線程來簡化實現。我仍然覺得這種行爲很奇怪。它基本上意味着每個足夠複雜的內核最終都會導致崩潰。 – user1488118

正確的，在Windows下WDDM的GPU，除非你做一些更改系統設置。 TDR監視程序的行爲最終是Microsoft操作系統的一部分，並由OS強制執行。 NVIDIA提供各種可以置入TCC模式的GPU，以將其從這一限制中移除，但這隻適用於某些GPU，並且此類GPU不能再爲顯示器提供服務。 –

C++：簡單的CUDA卷重構代碼崩潰

回答

相關問題