2012-09-06 81 views
1

我想探索CUBLAS庫,因此使用它的API編寫了矩陣乘法的代碼。但我越來越奇怪的輸出。我正在粘貼下面的代碼和輸出。請幫幫我。CUBLAS庫沒有給出正確的結果

#include<cublas.h> 

// Thread block size 
#define BLOCK_SIZE 3 

#define WA 3 // Matrix A width 
#define HA 3 // Matrix A height 
#define WB 3 // Matrix B width 
#define HB WA // Matrix B height 
#define WC WB // Matrix C width 
#define HC HA // Matrix C height 
// Allocates a matrix with random float entries. 
void randomInit(float* data, int size) 
{ 
    for (int i = 0; i < size; ++i) 
    data[i] = i; 
} 
///////////////////////////////////////////////////////// 
// Program main 
///////////////////////////////////////////////////////// 

int main(int argc, char** argv) 
{ 

    // 1. allocate host memory for matrices A and B 
    unsigned int size_A = WA * HA; 
    unsigned int mem_size_A = sizeof(float) * size_A; 
    float* h_A = (float*) malloc(mem_size_A); 

    unsigned int size_B = WB * HB; 
    unsigned int mem_size_B = sizeof(float) * size_B; 
    float* h_B = (float*) malloc(mem_size_B); 
    cublasStatus_t status; 
    // 2. initialize host memory 
    randomInit(h_A, size_A); 
    randomInit(h_B, size_B); 

    // 3. print out A and B 
    printf("\n\nMatrix A\n"); 
    for(int i = 0; i < size_A; i++) 
    { 
     printf("%f ", h_A[i]); 
     if(((i + 1) % WA) == 0) 
      printf("\n"); 
    } 

    printf("\n\nMatrix B\n"); 
for(int i = 0; i < size_B; i++) 
{ 
    printf("%f ", h_B[i]); 
    if(((i + 1) % WB) == 0) 
     printf("\n"); 
} 
// 8. allocate device memory 
float* d_A; 
float* d_B; 
cudaMalloc((void**) &d_A, mem_size_A); 
cudaMalloc((void**) &d_B, mem_size_B); 

// 9. copy host memory to device 

status = cublasSetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float), h_A, BLOCK_SIZE,d_A, BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! CUBLAS initialization error\n"); 
    return EXIT_FAILURE; 
} 

status = cublasSetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float), h_B, BLOCK_SIZE,d_B, BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! CUBLAS initialization error\n"); 
    return EXIT_FAILURE; 
} 

// 4. allocate host memory for the result C 
unsigned int size_C = WC * HC; 
unsigned int mem_size_C = sizeof(float) * size_C; 
float* h_C = (float*) malloc(mem_size_C); 

// 10. allocate device memory for the result 
float* d_C; 
cudaMalloc((void**) &d_C, mem_size_C); 

// 5. perform the calculation 
      cublasSgemm('N','N',BLOCK_SIZE,BLOCK_SIZE,BLOCK_SIZE,1.0f,d_A,BLOCK_SIZE,d_B,BLOCK_SIZE,1.0f,d_C,BLOCK_SIZE); 
status = cublasGetError(); 
if (status) { 
    fprintf (stderr, "!!!! kernel execution error.\n"); 
    return EXIT_FAILURE; 
} 

// 11. copy result from device to host 

status = cublasGetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float),d_C, BLOCK_SIZE,h_C,BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! device access error (read C)\n"); 
    return EXIT_FAILURE; 
} 

// 6. print out the results 
printf("\n\nMatrix C (Results)\n"); 
for(int i = 0; i < size_C; i++) 
{ 
    printf("%f ", h_C[i]); 
    if(((i + 1) % WC) == 0) 
     printf("\n"); 
} 
printf("\n"); 
// 7. clean up memory 
free(h_A); 
free(h_B); 
free(h_C); 
cudaFree(d_A); 
cudaFree(d_B); 
cudaFree(d_C); 

} 

---------輸出-------------

基質A

0.000000 1.000000 2.000000

3.000000 4.000000 5.000000

6.000000 7.000000 8.000000

基質B

0.000000 1.000000 2.000000

3.000000 4.000000 5.000000

6.000000 7.000000 8.000000

矩陣C(結果)

-1998397155538108416.000000 -1998397155538108416.000000 -1998397155538108416.000000

-1998397155538108416.000000 -1998397155538108416.000000 -1998397155538108416.000000

-1998397155538108416.000000 -1998397155538108416.000000 -1998397155538108416.000000

回答

3

你的問題是,你正在使用未初始化的內存在SGEMM通話。 cublas_sgemm(),像所有的BLAS GEMM操作計算

C = alpha * op(A) * op(B) + beta * C 

在代碼中,你逝去的op(A)=Aop(B)=Balpha=1.beta=1.。但是您從未將C的值設置爲任何值,GPU中的內存未初始化並且可以包含隨機值,從而給出您看到的損壞結果。你的函數調用改成這樣:

cublasSgemm('N','N',BLOCK_SIZE,BLOCK_SIZE,BLOCK_SIZE,1.0f,d_A, 
      BLOCK_SIZE,d_B,BLOCK_SIZE,0.f,d_C,BLOCK_SIZE); 

其計算

C = 1.0 * A * B + 0. * C 

,你應該得到一個比較合理的輸出。一旦你得到它產生的輸出,請繼續發現,CUBLAS假設矩陣存儲在列的主要順序,所以正確的打印輸出爲您打印輸出應該是

15 18 21 
42 54 66 
69 90 111 
+0

非常感謝...它的工作:) – user1439690

+1

@ user1439690:如果這個答案解決了你的問題,也許你會這麼好[接受它](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-工作)。 – talonmies

相關問題