CUDA - 相同的算法適用於CPU，但不適用於GPU

-1

我目前正在開發CUDA中的第一個項目，並且遇到了一些奇怪的問題，這些問題必須是CUDA固有的，我不理解或忽略。相同的算法 - 完全相同的算法 - 不涉及並行工作 - 在CPU上工作，但不在GPU上。CUDA - 相同的算法適用於CPU，但不適用於GPU

讓我更詳細地解釋一下。我正在使用Otsu's method重複計算來確定閾值，但會減少傳輸時間。短篇小說長，這個功能：

__device__ double computeThreshold(unsigned int* histogram, int* nbPixels){ 
    double sum = 0; 
    for (int i = 0; i < 256; i++){ 
     sum += i*histogram[i]; 
    } 
    int sumB = 0, wB = 0, wF = 0; 
    double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0; 
    for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){ 
     wB += histogram[j]; 
     if (wB != 0) { 
      wF = *nbPixels - wB; 
      if (wF != 0){ 
       sumB += j*histogram[i]; 
       mB = sumB/wB; 
       mF = (sum - sumB)/wF; 
       between = wB * wF *(mB - mF) *(mB - mF); 
       if (max < 2.0){ 
        threshold1 = j; 
        if (between > max){ 
         threshold2 = j; 
        } 
        max = between; 
       } 
      } 
     } 
    } 

    return (threshold1 + threshold2)/2.0; 
}

這將按預期的圖像大小（即像素數）不會太大，但無法以其他方式;有趣的是，即使我不在函數中使用histogram和nbPixels，並用常量替換它們的所有出現，它仍然會失敗 - 即使我從函數中刪除參數。（我的意思是失敗的是在調用內核之後的第一個操作返回未指定的啓動失敗。）

編輯3：好的，由於複製/粘貼錯誤，測試。現在，這個編譯並允許重現錯誤：

__device__ double computeThreshold(unsigned int* histogram, long int* nbPixels){ 
    double sum = 0; 
    for (int i = 0; i < 256; i++){ 
     sum += i*histogram[i]; 
    } 
    int sumB = 0, wB = 0, wF = 0; 
    double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0; 
    for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){ 
     wB += histogram[j]; 
     if (wB != 0) { 
      wF = *nbPixels - wB; 
      if (wF != 0){ 
       sumB += j*histogram[j]; 
       mB = sumB/wB; 
       mF = (sum - sumB)/wF; 
       between = wB * wF *(mB - mF) *(mB - mF); 
       if (max < 2.0){ 
        threshold1 = j; 
        if (between > max){ 
         threshold2 = j; 
        } 
        max = between; 
       } 
      } 
     } 
    } 

    return (threshold1 + threshold2)/2.0; 
} 

__global__ void imageKernel(unsigned int* image, unsigned int* histogram, long int* nbPixels, double* t_threshold){ 

    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x; 
    if (i >= *nbPixels) return; 
    double threshold = computeThreshold(histogram, nbPixels); 
    unsigned int pixel = image[i]; 
    if (pixel >= threshold){ 
     pixel = 255; 
    } else { 
     pixel = 0; 
    } 
    image[i] = pixel; 

    *t_threshold = threshold; 

} 


int main(){ 
unsigned int histogram[256] = { 0 }; 
const int width = 2048 * 4096; 
const int height = 1; 

unsigned int* myimage; 
myimage = new unsigned int[width*height]; 
for (int i = 0; i < width*height; i++){ 
    myimage[i] = i % 256; 
    histogram[i % 256]++; 
} 
const int threadPerBlock = 256; 
const int nbBlock = ceil((double)(width*height)/threadPerBlock); 
unsigned int* partial_histograms = new unsigned int[256 * nbBlock]; 

dim3 dimBlock(threadPerBlock, 1); 
dim3 dimGrid(nbBlock, 1); 
unsigned int* dev_image; 
unsigned int* dev_histogram; 
unsigned int* dev_partial_histograms; 
double* dev_threshold; 
double x = 0; 
double* threshold = &x; 
long int* nbPixels; 
long int nb = width*height; 
nbPixels = &(nb); 
long int* dev_nbPixels; 

cudaSetDevice(0); 
cudaMalloc((void**)&dev_image, sizeof(unsigned int)*width*height); 
cudaMalloc((void**)&dev_histogram, sizeof(unsigned int)* 256); 
cudaMalloc((void**)&dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock); 
cudaMalloc((void**)&dev_threshold, sizeof(double)); 
cudaMalloc((void**)&dev_nbPixels, sizeof(long int)); 
cudaMemcpy(dev_image, myimage, sizeof(unsigned int)*width*height, cudaMemcpyHostToDevice); 
cudaMemcpy(dev_histogram, histogram, sizeof(unsigned int)* 256, cudaMemcpyHostToDevice); 
cudaMemcpy(dev_nbPixels, nbPixels, sizeof(long int), cudaMemcpyHostToDevice); 





imageKernel<<<dimGrid, dimBlock>>>(dev_image, dev_histogram, dev_nbPixels, dev_threshold); 



cudaMemcpy(histogram, dev_histogram, sizeof(unsigned int)* 256, cudaMemcpyDeviceToHost); 
cudaMemcpy(partial_histograms, dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock, cudaMemcpyDeviceToHost); 
cudaMemcpy(threshold, dev_threshold, sizeof(double), cudaMemcpyDeviceToHost); 

cudaDeviceReset(); 


return 0; 
}

編輯4：我的GPU的特性

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 1 CUDA Capable device(s) 

Device 0: "GeForce GT 750M" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.0 
    Total amount of global memory:     2048 MBytes (2147483648 bytes) 
    (2) Multiprocessors, (192) CUDA Cores/MP:  384 CUDA Cores 
    GPU Max Clock rate:       1085 MHz (1.09 GHz) 
    Memory Clock rate:        900 Mhz 
    Memory Bus Width:        128-bit 
    L2 Cache Size:         262144 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 
3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 1 copy engine(s) 
    Run time limit on kernels:      Yes 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Disabled 
    CUDA Device Driver Mode (TCC or WDDM):   WDDM (Windows Display Driver Mo 
del) 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/1/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simu 
ltaneously) > 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Versi 
on = 7.5, NumDevs = 1, Device0 = GeForce GT 750M 
Result = PASS

編輯5：我又跑CUDA-MEMCHECK而這一次，它沒有輸出錯誤信息。我不知道爲什麼它不是第一次，我一定再犯過錯。我希望你能原諒我那些猶豫不決和浪費時間。這裏是輸出消息：

========= CUDA-MEMCHECK 
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc 
h failure" on CUDA API call to cudaMemcpy. 
=========  Saved host backtrace up to driver entry point at error 
=========  Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780 
2) [0xdb1e2] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc764] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24] 
=========  Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 
0x22) [0x13d2] 
=========  Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3 
4) [0x15454] 
========= 
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc 
h failure" on CUDA API call to cudaMemcpy. 
=========  Saved host backtrace up to driver entry point at error 
=========  Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780 
2) [0xdb1e2] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc788] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24] 
=========  Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 
0x22) [0x13d2] 
=========  Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3 
4) [0x15454] 
========= 
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc 
h failure" on CUDA API call to cudaMemcpy. 
=========  Saved host backtrace up to driver entry point at error 
=========  Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780 
2) [0xdb1e2] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc7a6] 
=========  Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24] 
=========  Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 
0x22) [0x13d2] 
=========  Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3 
4) [0x15454] 
========= 
========= ERROR SUMMARY: 3 errors

雖然不是很明顯，是嗎？

來源

2015-12-14 Nico

您需要提供主機代碼。 – brano

在這類調試問題中，除非您能提供其他人可以複製並粘貼到編輯器中的最短，完整的代碼，編譯並運行，並且能夠重現您的錯誤，我們無法爲您提供幫助。 CUDA附帶了用於檢測內存訪問錯誤的工具，如cuda-memcheck。你有沒有嘗試過使用它們？ – talonmies

@talonmies我知道這很難 - 如果不是不可能的話 - 就像這樣發現錯誤，但我認爲可能存在一個我可能忽略的相對基本的原則。我試過cuda-memcheck，是的，它沒有發現任何錯誤。 – Nico

好的，事實證明，這不是我身邊的錯誤，但Windows決定2s就夠了，它需要重置GPU - 在那裏停止我的計算。非常感謝@RobertCrovella，沒有他我永遠不會發現這一點。並感謝所有試圖回答的人。

來源

2015-12-14 16:42:43 Nico

因此提供了一個可編譯例子後（這是真的這麼難？），我不能與此代碼複製任何錯誤（64位Linux，計算3.0設備，CUDA 7.0發佈版本）：

$ nvcc -arch=sm_30 -Xptxas="-v" histogram.cu 
ptxas info : 0 bytes gmem 
ptxas info : Compiling entry function '_Z11imageKernelPjS_PlPd' for 'sm_30' 
ptxas info : Function properties for _Z11imageKernelPjS_PlPd 
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info : Used 34 registers, 352 bytes cmem[0], 16 bytes cmem[2] 

$ for i in `seq 1 20`; 
> do 
>  cuda-memcheck ./a.out 
> done 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors 
========= CUDA-MEMCHECK 
========= ERROR SUMMARY: 0 errors

因此，如果您可以像我一樣重現運行時錯誤，那麼您的環境/硬件/工具包版本在某種程度上與我的略有不同。但在任何情況下，代碼本身都能正常工作，並且您有一個我無法重現的平臺特定問題。

來源

2015-12-14 15:58:24 talonmies

事實證明我*確實*有一個cuda-memcheck錯誤，由於某種原因它沒有出現第一次 – Nico

CUDA - 相同的算法適用於CPU，但不適用於GPU

回答

相關問題