我正在嘗試Cuda編程。作爲這一部分,我試圖開發一個矩陣乘法算法在GPU上運行。該算法適用於平方矩陣,但不適用於非方形矩陣。 這裏是我的內核Cuda矩陣乘法 - 不適用於某些非方形矩陣
float* multiply_gpu(float* matrix1 , float* matrix2);
__global__ void mult(int rowsA , int columnsA, int rowsB,int columnsB, float *a,
float *b, float *result) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int result_size = rowsA*columnsB;
int value = 0;//the final result
//indices of values from input matrices
if (index < result_size) {
int index1 = (index/rowsA)*rowsA; //get nearest row
int index2 = index%columnsB; //get start column
int k = 0;
while (k<columnsA) { //columnsA == rowsB
value += a[index1]*b[index2]; //v = sum a_ik * b_kj
index1 ++;
index2 += columnsB;
k++;
}
result[index] = value;
}
}
做我的導師簡要全面的檢查之後,他還沒有看到它的任何問題。 我相信問題在於這樣的功能:
float* multiply_gpu(float* matrix1 , float* matrix2) {
//the dimensions of the matrices
size_t available, total;
cudaError_t error;
cudaError err = cudaMemGetInfo(&available, &total);
if(err != cudaSuccess){
printf("There was an error: %s\n", cudaGetErrorString(err));
}
int height1 = matrix1[0];
int width1 = matrix1[1];
int height2 = matrix2[0];
int width2 = matrix2[1];
if (width1!=height2) {
return NULL;
}
//this array contains the result of the operation
float* result = (float *) malloc(height1*width2*sizeof(float));
//pointers for device matrices
float *d_matrix1;
float *d_matrix2;
float *d_result;
//allocate memory for matrices
error = cudaMalloc((void **)&d_matrix1,(size_t)height1*width1*sizeof(float));
if (error != cudaSuccess) {
fprintf(stderr, "Failed to allocate memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
error = cudaMalloc((void **)&d_matrix2,height2*width2*sizeof(float));
if (error != cudaSuccess) {
fprintf(stderr, "Failed to allocate memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
error = cudaMalloc((void **)&d_result,height1*width2*sizeof(float));
if (error != cudaSuccess) {
fprintf(stderr, "Failed to allocate memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
//now copy matrices onto device -- note the offset of 2
error = cudaMemcpy(d_matrix1 , matrix1+2 , height1*width1*sizeof(float), cudaMemcpyHostToDevice);
if (error != cudaSuccess) {
fprintf(stderr, "Failed to copy memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
error = cudaMemcpy(d_matrix2 , matrix2+2 , height2*width2*sizeof(float), cudaMemcpyHostToDevice);
if (error != cudaSuccess) {
fprintf(stderr, "Failed to copy memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
//launch multiplication kernel
//note I have tried adjusting the kernel values between <<< , >>> to no avail
mult<<<height1,width2>>>(height1,width1,height2,width2,d_matrix1,d_matrix2,d_result);
printf("%d %d %d %d\n",height1,width1,height2,width2);
//make the host block until mult is finished running
//printf("finished multiplying\n");
cudaDeviceSynchronize();
//copy result back
error = cudaMemcpy(result,d_result,height1*width2*sizeof(float),cudaMemcpyDeviceToHost);
if (error != cudaSuccess) {
fprintf(stderr, "Failed to copy memory (error code %s)!\n", cudaGetErrorString(error));
exit(EXIT_FAILURE);
}
//free now unneeded cuda memory
cudaFree(d_matrix1);
cudaFree(d_matrix2);
cudaFree(d_result);
printf("GOT RESULT\n");
for (int i=0;i<height1*width2;i++) {
printf("%f ",result[i]);
}
printf("\n");
//result ready to be returned
return result;
}
注意,它們是參數multiply_gpu所述矩陣具有索引0和寬度其高度在指數1.結果矩陣不具有此信息。
不正確的計算的一個例子: 當我喂以下數組到multiply_gpu - {2,3,1,2,3,4,5,6 },{3,2,1,2- ,3,4,5,6}答案應該是{22,28,49,64},但是我的單元測試會生成{22,28,40,52}。很近!請注意,對於(1,2,3)*(1,2,3)(不是方形)的點積,該算法很快樂......這裏可能有什麼錯誤?感謝您的幫助。如果我單獨找到一個解決方案,將會發布解
有關於矩陣乘法的CUDA標籤相當多的問題。你有看過嗎?如果你用'cuda-memcheck'運行你的代碼會發生什麼? SO期望:「關於您編寫的代碼問題的問題必須在問題本身中描述具體問題 - 幷包含有效的代碼以再現問題本身。請參閱SSCCE.org以獲取指導。」投票結束。您尚未提供SSCCE.org代碼。 –
是的矩陣乘法在GPU上很常見,並且存在許多關於它的SO問題。我已經讀過它們,但可能不夠徹底。我只是在我的智慧結束,來到這裏得到一個理智的檢查。感謝您與SSCCE.org的鏈接 - 我現在正在審覈它。我也在學習cuda-memcheck。總的來說,我面臨的這個錯誤正在消耗我。我認爲需要更多地關注我自己的代碼和其他矩陣乘法器的評論。 – YardGlassOfCode
我更新了我的答案,因爲我仍然不太對。我認爲現在是正確的 - 它適用於您提及的案例以及我嘗試過的其他三個案例。 –