它已經兩天了,我仍然無法弄清楚爲什麼我的CUDA矩陣乘法的實現不同於MATLAB中產生的結果。Cuda矩陣乘法結果不同於MATLAB
CUDA內核:A(200x60000)= W(200x784)*數據(784x6000)
__global__ void CalculateA(Matrix W, Matrix Data, Matrix A)
{
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < W.row) && (Col < Data.col)){
float Cvalue = 0.0;
for (int i = 0; i < W.col; ++i){
Cvalue += W.elements[Row*W.col+i] * Data.elements[i*Data.col+Col];
}
A.elements[Row*A.col+Col] = Cvalue;
}
}
並調用內核:
void myFunc(Matrix W1, Matrix data){
Matrix d_W1, d_data, d_a2, a2;
size_t size;
a2.row = W1.row; d_a2.row = a2.row;
a2.col = data.col; d_a2.col = a2.col;
size = a2.col*a2.row*sizeof(float);
cudaMalloc(&d_a2.elements,size);
d_W1.row = W1.row; d_W1.col = W1.col;
size = W1.col*W1.row*sizeof(float);
cudaMalloc(&d_W1.elements,size);
cudaMemcpy(d_W1.elements,W1.elements,size,cudaMemcpyHostToDevice);
d_data.col = data.col; d_data.row = data.row;
size = data.row*data.col*sizeof(float);
cudaMalloc(&d_data.elements,size);
cudaMemcpy(d_data.elements,data.elements,size,cudaMemcpyHostToDevice);
dim3 dimGrid(data.col/32 + 1, W1.row/32 + 1, 1);
dim3 dimBlock(32, 32, 1);
CalculateA<<<dimGrid, dimBlock>>>(d_W1, d_data, d_a2);
a2.elements = new float [a2.row*a2.col];
cudaMemcpy(a2.elements,d_a2.elements,sizeof(float)*a2.row*a2.col,cudaMemcpyDeviceToHost);
printf("\nA2 first and last member %f - %f\n",a2.elements[0],a2.elements[a2.row*a2.col-1]);
}
結果差不低例如第一和最後CUDA代碼的元素爲0.011322和-0.179534,但在MATLAB中乘以0.4280和0.0056。
這是我要做的事在MATLAB:
>> size(W1) ans = 200 784
>> size(data) ans = 784 60000
>> z2=W1*data;
>> size(z2) ans = 200 60000
>> z2 = z2(:);
>> z2(1) ans = 0.4280
>> z2(200*60000)ans = 0.0056
你真的有問題嗎? (你知道Matlab默認以雙精度執行所有浮點運算?= – talonmies
那麼我的問題是如何使用CUDA重現MATLAB結果,是的,我將它們轉換爲單精度後保存了我的mat文件。 – HadiRj
試着從這很簡單,就像兩個標量和兩個2x2矩陣的矩陣乘法一樣,如果你已經卡在小矩陣上,觀察每個中間結果,希望結合輸入矩陣挖掘出來錯誤 –