使用CUDA進行矩陣乘法運算

我在CUDA上進行了矩陣乘法運算。由此產生的產品矩陣總是爲零。我已經閱讀了一些示例代碼，如matrix multiplication in cuda來解決我的問題，但都是徒勞的。使用CUDA進行矩陣乘法運算

除了0的不穩定結果之外，「寬度」（下面的代碼）的最大尺寸甚至不是512.我無法調試問題出在哪裏。也許我們可以在StackOverflow上討論它。

我指的是「編程大規模並行處理器」

#include<cuda.h> 
#include<stdio.h> 

int main(void) { 
    void MatrixMultiplication(float *, float *, float *, int); 
    const int Width = 5; 
    float M[Width*Width], N[Width*Width], P[Width*Width]; 
    for(int i = 0; i < (Width*Width) ; i++) { 
     M[i] = 5; 
     N[i] = 5; 
     P[i] = 0; 
    } 
    MatrixMultiplication(M, N, P, Width); 
    for(int i = 0; i < (Width*Width) ; i++) { 
     printf("%d \n", P[i]); 
    } 
    int quit; 
    scanf("%d",&quit); 
    return 0; 
} 

//Matrix multiplication kernel - thread specification 
__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int Width) { 
    //2D Thread ID 
    int tx = threadIdx.x; 
    int ty = threadIdx.y; 

    //Pvalue stores the Pd element that is computed by the thread 
    float Pvalue = 0; 

    for(int k = 0; k < Width ; ++k) { 
     float Mdelement = Md[ty*Width + k]; 
     float Ndelement = Nd[k*Width + tx]; 
     Pvalue += (Mdelement*Ndelement); 
    } 

    Pd[ty*Width + tx] = Pvalue; 
} 

void MatrixMultiplication(float *M, float *N, float *P, int Width) { 
    int size = Width*Width*sizeof(float); 
    float *Md, *Nd, *Pd; 

    //Transfer M and N to device memory 
    cudaMalloc((void**)&Md, size); 
    cudaMemcpy(Md,M,size,cudaMemcpyHostToDevice); 
    cudaMalloc((void**)&Nd, size); 
    cudaMemcpy(Nd,N,size,cudaMemcpyHostToDevice); 

    //Allocate P on the device 
    cudaMalloc((void**)&Pd,size); 

    //Setup the execution configuration 
    dim3 dimBlock(Width,Width); 
    dim3 dimGrid(1,1); 

    //Launch the device computation threads! 
    MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width); 

    //Transfer P from device to host 
    cudaMemcpy(P,Pd,size,cudaMemcpyDeviceToHost); 

    //Free device matrices 
    cudaFree(Md); 
    cudaFree(Nd); 
    cudaFree(Pd); 
}

來源

2011-02-16 Gaurav Kalra

要獲得正確的代碼格式，您需要使用4個空格縮進所有代碼。您可以通過突出顯示您的代碼並按下「Ctrl + K」來輕鬆完成此操作。 –

謝謝傑夫！只是要做到這一點 –

如果您不需要堅持自己的代碼，那麼CUDA C編程指南就有一個非常棒的矩陣實現，它可以處理除了冪2之外的其他維度的矩陣，並使用共享內存進行優化。高度推薦它用於真實世界的使用和學習。 –

我想通了什麼錯誤。讓我們來分析一下：

點1：探索去除以往單調的「零值」

如前所述，則必須更換printf("%d \n", P[i]);作爲printf("%f \n", P[i]);

要點2：爲什麼程序失敗的價值寬度512？

事實上，即使是23這樣的小數值也會失敗。爲什麼？因爲23 * 23是> 512（每塊最大數量的線程，截至今天！）

來源

2011-02-17 18:54:46

在你MatrixMulKernel發揮你的for循環就好

for(int k = 0; k < Width ; ++k) 
{ 
    //rest of code  
}

相反的Width，您必須使用Width*Width作爲陣列的大小Width*Width。

來源

2011-02-16 21:04:08 Algorithmist

使用CUDA並行性的關鍵在於消除計算開銷。在這種情況下，每個線程僅負責產品矩陣的一個結果。產品矩陣的一個結果（元素）可以使用「寬度」迭代找到。所以寬度*寬度在任何情況下都不起作用。 –

就像@Gaurav所說的，寬度*寬度只會把內存吹起來.. – ardiyu07

你正在做的很好，直到這一點：

for(int i = 0; i < (Width*Width) ; i++) { 
    printf("%d \n", P[i]); 
}

我把它改爲％F（因爲它是一個浮動），它們都打印很好:)

$ ./test.exe 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000 
125.000000

來源

2011-02-17 14:54:35 ardiyu07

確實！儘管我沒有閱讀你的答案，但我只是想發佈它。 –

使用CUDA進行矩陣乘法運算

回答

相關問題