2012-11-09 31 views
2

我需要轉置一個方形矩陣。我用矩陣測試程序:a[i][j] = 0 if i>j, a[i][j] = if i<=j,但結果表明並非所有元素都在正確的位置。使用CUDA的矩陣轉置

下面的代碼(除主()):

#include <stdio.h> 
#include <stdlib.h> 
__global__ void transpose_kernel (float *a, float *b, int n) { 
    unsigned int ax = blockDim.x * blockIdx.x + threadIdx.x; 
    unsigned int ay = blockDim.y * blockIdx.y + threadIdx.y; 
    unsigned int aIdx = ax + n * ay; 
    unsigned int bIdx = ay + n * ax; 

    b[bIdx] = a[aIdx]; 
} 

int transpose_host (float *a, float *b, int n) { 
    int size = n * n * sizeof (float); 
    float *aDev = NULL, *bDev = NULL; 

    cudaError_t cuerr = cudaMalloc ((void**)&aDev, size); 
    if (cuerr != cudaSuccess) { 
     fprintf (stderr, "Cannot allocate GPU memory for aDev: %s\n", cudaGetErrorString (cuerr)); 
     return (-1); 
    } 

cuerr = cudaMalloc ((void**)&bDev, size); 
if (cuerr != cudaSuccess) { 
    fprintf (stderr, "Cannot allocate GPU memory for bDev: %s\n", cudaGetErrorString (cuerr)); 
    return (-1); 
} 

dim3 blockSize = dim3 (16, 16, 1); 
dim3 gridSize = dim3 (n/16 + 1, n/16 + 1, 1); 

cuerr = cudaMemcpy (aDev, a, size, cudaMemcpyHostToDevice); 
if (cuerr != cudaSuccess) { 
    fprintf (stderr, "Cannot copy data from a to aDev: %s\n", cudaGetErrorString (cuerr)); 
    return (-1); 
} 

transpose_kernel <<< gridSize, blockSize >>> (aDev, bDev, n); 

cuerr = cudaGetLastError(); 
if (cuerr != cudaSuccess) { 
    fprintf (stderr, "Cannot launch CUDA kernel: %s\n", cudaGetErrorString (cuerr)); 
    return (-1); 
} 

cuerr = cudaDeviceSynchronize(); 
if (cuerr != cudaSuccess) { 
    fprintf (stderr, "Cannot synchronize CUDA kernel: %s\n", cudaGetErrorString (cuerr)); 
    return (-1); 
} 

cuerr = cudaMemcpy (b, bDev, size, cudaMemcpyDeviceToHost); 
if (cuerr != cudaSuccess) { 
    fprintf (stderr, "Cannot copy data from b to bDev: %s\n", cudaGetErrorString (cuerr)); 
    return (-1); 
} 

cudaFree (aDev); 
cudaFree (bDev); 

    return (0); 
} 

爲什麼我的陣列正確調換?

回答

2

的問題是「額外」的線程分配的陣列外去。

當你將你的網格塊,你圓了(其實,強制取整到下一個整數,即使事情勻:)

dim3 gridSize = dim3 (n/16 + 1, n/16 + 1, 1); 

這樣,總是有線程有其AX或AY秋天在[0,n)之外。因此,無論如何,當您將a[aIdx]複製到b[bIdx]時,您將隨機數據複製到內存中,實際上可能會根據調度覆蓋「真實」數據。

你可以通過改變你的內核修復此檢查此:

if (ax < n && ay < n) 
    b[bIdx] = a[aIdx]; 

,你可能想改變你的網格大小的圓不圓了,如果事情勻:

dim3 gridSize = dim3 ((n+15)/16, (n+15)/16, 1); 
+0

謝謝,現在它工作! – Max