2017-04-23 175 views
1

我想在設備上分配矩陣,在內核中填充一些數字,然後將其複製回主機。問題是在主機上只有一行似乎被填滿。CUDA二維數組

我得到了這樣的事情:

9 9 9 9 
-1 -1 -1 -1 
-1 -1 -1 -1 
-1 -1 -1 -1 

這裏是我的代碼:

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include <stdio.h> 
#include <stdlib.h> 

void check(cudaError x) { 
    fprintf(stderr, "%s\n", cudaGetErrorString(x)); 
} 

void showMatrix2(int* v1, int width, int height) { 
    printf("---------------------\n"); 
    for (int i = 0; i < width; i++) { 
     for (int j = 0; j < height; j++) { 
      printf("%d ", v1[i * width + j]); 
     } 
     printf("\n"); 
    } 
} 

__global__ void kernel(int* tab,int width, int height, int pitch) { 

    int row = threadIdx.x + blockIdx.x * blockDim.x; 
    int col = threadIdx.y + blockIdx.y * blockDim.y; 

    if (row < width && col < height) { 
     tab[col * pitch + row] = 9; 
    } 
} 

int main() 
{ 
    int width = 4; 
    int height = 4; 

    int* d_tab; 
    int* h_tab; 

    int realSize = width * height* sizeof(int); 

    size_t pitch; 
    check(cudaMallocPitch(&d_tab, &pitch, width * sizeof(int), height)); 
    h_tab = (int*)malloc(realSize); 
    check(cudaMemset(d_tab, 0, realSize)); 

    dim3 grid(4, 4); 
    dim3 block(4, 4); 
    kernel <<<grid, block>>>(d_tab, width, height, pitch); 

    check(cudaMemcpy2D(h_tab, width*sizeof(int), d_tab, pitch, width*sizeof(int), height, cudaMemcpyDeviceToHost)); 

    showMatrix2(h_tab, width, height); 
    printf("\nPitch size: %d \n", pitch); 
    getchar(); 
    return 0; 
} 

回答

2
  1. 您有一個CUDA代碼麻煩的任何時間,除了做錯誤檢查,用cuda-memcheck運行你的代碼。如果你這樣做了,你至少會得到一個關於發生了什麼的提示,然後你可以使用像this這樣的技術來繼續你自己的調試。即使你無法弄清楚,cuda-memcheck輸出對其他試圖幫助你的人也很有用。

  2. 在內核中有無效寫入。這裏有多個錯誤。爲了正確訪問內核代碼中的音調分配,我強烈建議研究the documentationcudaMallocPitch的示例。概括地說,這種索引生成的只是破:

    tab[col * pitch + row] 
    

    首先,通過cudaMallocPitch返回pitch是在字節的寬度。您不能將其用作數量索引的調整,如intfloat(研究文檔)。其次,音高值最終應該乘以索引,而不是列索引。

  3. 不相關的問題,而是你的最終printf語句,如果你是一個64位平臺上不正確的格式說明,它應該是%ld(或更好,%lu)。

這裏是具有固定索引問題上的代碼,它似乎爲我正常工作:

$ cat t109.cu 
#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include <stdio.h> 
#include <stdlib.h> 

void check(cudaError x) { 
    fprintf(stderr, "%s\n", cudaGetErrorString(x)); 
} 

void showMatrix2(int* v1, int width, int height) { 
    printf("---------------------\n"); 
    for (int i = 0; i < width; i++) { 
     for (int j = 0; j < height; j++) { 
      printf("%d ", v1[i * width + j]); 
     } 
     printf("\n"); 
    } 
} 

__global__ void kernel(int* tab,int width, int height, int pitch) { 

    int row = threadIdx.x + blockIdx.x * blockDim.x; 
    int col = threadIdx.y + blockIdx.y * blockDim.y; 

    if (row < width && col < height) { 
     *(((int *)(((char *)tab) + (row * pitch))) + col) = 9; 
    } 
} 

int main() 
{ 
    int width = 4; 
    int height = 4; 

    int* d_tab; 
    int* h_tab; 

    int realSize = width * height* sizeof(int); 

    size_t pitch; 
    check(cudaMallocPitch(&d_tab, &pitch, width * sizeof(int), height)); 
    h_tab = (int*)malloc(realSize); 
    check(cudaMemset(d_tab, 0, realSize)); 

    dim3 grid(4, 4); 
    dim3 block(4, 4); 
    kernel <<<grid, block>>>(d_tab, width, height, pitch); 

    check(cudaMemcpy2D(h_tab, width*sizeof(int), d_tab, pitch, width*sizeof(int), height, cudaMemcpyDeviceToHost)); 

    showMatrix2(h_tab, width, height); 
    printf("\nPitch size: %ld \n", pitch); 
    return 0; 
} 
$ nvcc -arch=sm_61 -o t109 t109.cu 
$ cuda-memcheck ./t109 
========= CUDA-MEMCHECK 
no error 
no error 
no error 
--------------------- 
9 9 9 9 
9 9 9 9 
9 9 9 9 
9 9 9 9 

Pitch size: 512 
========= ERROR SUMMARY: 0 errors 
$ 
+0

非常感謝幫助:) – Knight