我目前正在開發CUDA中的第一個項目,並且遇到了一些奇怪的問題,這些問題必須是CUDA固有的,我不理解或忽略。相同的算法 - 完全相同的算法 - 不涉及並行工作 - 在CPU上工作,但不在GPU上。CUDA - 相同的算法適用於CPU,但不適用於GPU
讓我更詳細地解釋一下。我正在使用Otsu's method重複計算來確定閾值,但會減少傳輸時間。短篇小說長,這個功能:
__device__ double computeThreshold(unsigned int* histogram, int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[i];
mB = sumB/wB;
mF = (sum - sumB)/wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
max = between;
return (threshold1 + threshold2)/2.0;
,並用常量替換它們的所有出現,它仍然會失敗 - 即使我從函數中刪除參數。 (我的意思是失敗的是在調用內核之後的第一個操作返回未指定的啓動失敗。)
__device__ double computeThreshold(unsigned int* histogram, long int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[j];
mB = sumB/wB;
mF = (sum - sumB)/wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
max = between;
return (threshold1 + threshold2)/2.0;
__global__ void imageKernel(unsigned int* image, unsigned int* histogram, long int* nbPixels, double* t_threshold){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if (i >= *nbPixels) return;
double threshold = computeThreshold(histogram, nbPixels);
unsigned int pixel = image[i];
if (pixel >= threshold){
pixel = 255;
} else {
pixel = 0;
image[i] = pixel;
*t_threshold = threshold;
int main(){
unsigned int histogram[256] = { 0 };
const int width = 2048 * 4096;
const int height = 1;
unsigned int* myimage;
myimage = new unsigned int[width*height];
for (int i = 0; i < width*height; i++){
myimage[i] = i % 256;
histogram[i % 256]++;
const int threadPerBlock = 256;
const int nbBlock = ceil((double)(width*height)/threadPerBlock);
unsigned int* partial_histograms = new unsigned int[256 * nbBlock];
dim3 dimBlock(threadPerBlock, 1);
dim3 dimGrid(nbBlock, 1);
unsigned int* dev_image;
unsigned int* dev_histogram;
unsigned int* dev_partial_histograms;
double* dev_threshold;
double x = 0;
double* threshold = &x;
long int* nbPixels;
long int nb = width*height;
nbPixels = &(nb);
long int* dev_nbPixels;
cudaMalloc((void**)&dev_image, sizeof(unsigned int)*width*height);
cudaMalloc((void**)&dev_histogram, sizeof(unsigned int)* 256);
cudaMalloc((void**)&dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock);
cudaMalloc((void**)&dev_threshold, sizeof(double));
cudaMalloc((void**)&dev_nbPixels, sizeof(long int));
cudaMemcpy(dev_image, myimage, sizeof(unsigned int)*width*height, cudaMemcpyHostToDevice);
cudaMemcpy(dev_histogram, histogram, sizeof(unsigned int)* 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_nbPixels, nbPixels, sizeof(long int), cudaMemcpyHostToDevice);
imageKernel<<<dimGrid, dimBlock>>>(dev_image, dev_histogram, dev_nbPixels, dev_threshold);
cudaMemcpy(histogram, dev_histogram, sizeof(unsigned int)* 256, cudaMemcpyDeviceToHost);
cudaMemcpy(partial_histograms, dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock, cudaMemcpyDeviceToHost);
cudaMemcpy(threshold, dev_threshold, sizeof(double), cudaMemcpyDeviceToHost);
return 0;
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 750M"
CUDA Driver Version/Runtime Version 7.5/7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
(2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1085 MHz (1.09 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Mo
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID/Bus ID/location ID: 0/1/0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Versi
on = 7.5, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc764]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc788]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc7a6]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
========= ERROR SUMMARY: 3 errors
您需要提供主機代碼。 – brano
在這類調試問題中,除非您能提供其他人可以複製並粘貼到編輯器中的最短,完整的代碼,編譯並運行,並且能夠重現您的錯誤,我們無法爲您提供幫助。 CUDA附帶了用於檢測內存訪問錯誤的工具,如cuda-memcheck。你有沒有嘗試過使用它們? – talonmies
@talonmies我知道這很難 - 如果不是不可能的話 - 就像這樣發現錯誤,但我認爲可能存在一個我可能忽略的相對基本的原則。我試過cuda-memcheck,是的,它沒有發現任何錯誤。 – Nico