我正在使用MKL的新Intel Xeon Phi協處理器的自動卸載測試?GEMM,?TRMM,?TRSM的性能,並且在DTRMM和DTRSM中遇到了一些問題。我有代碼來測試矩陣大小的性能,步長爲1024到10240,性能似乎在N = M = K = 8192之後的某個地方顯着下降。當我嘗試使用2的步長大小確切地測試時,我的腳本被掛起。然後我檢查了512個步長,這很好,256也可以工作,但是256下的任何東西都會停下來。我無法找到有關此問題的任何已知問題。所有單精度版本都可以工作,以及GEMM上的單精度和雙精度。這裏是我的代碼:DTRMM&DTRSM掛在某些矩陣大小
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include <time.h>
#include "mkl.h"
#define DBG 0
int main(int argc, char **argv)
{
char transa = 'N', side = 'L', uplo = 'L', diag = 'U';
MKL_INT N, NP; // N = M, N, K, lda, ldb, ldc
double alpha = 1.0; // Scaling factors
double *A, *B; // Matrices
int matrix_bytes; // Matrix size in bytes
int matrix_elements; // Matrix size in elements
int i, j; // Counters
int msec;
clock_t start, diff;
N = atoi(argv[1]);
start = clock();
matrix_elements = N * N;
matrix_bytes = sizeof(double) * matrix_elements;
// Allocate the matrices
A = malloc(matrix_bytes);
if (A == NULL)
{
printf("Could not allocate matrix A\n");
return -1;
}
B = malloc(matrix_bytes);
if (B == NULL)
{
printf("Could not allocate matrix B\n");
return -1;
}
for (i = 0; i < matrix_elements; i++)
{
A[i] = 0.0;
B[i] = 0.0;
}
// Initialize the matrices
for (i = 0; i < N; i++)
for (j = 0; j <= i; j++)
{
A[i+N*j] = 1.0;
B[i+N*j] = 2.0;
}
// DTRMM call
dtrmm(&side, &uplo, &transa, &diag, &N, &N, &alpha, A, &N, B, &N);
diff = clock() - start;
msec = diff * 1000/CLOCKS_PER_SEC;
printf("%f\n", (float)msec * 10e-4);
if (DBG == 1)
{
printf("\nMatrix dimension is set to %d \n\n", (int)N);
// Display the result
printf("\nResulting matrix B:\n");
if (N > 10)
{
printf("NOTE: B is too large, print only upper-left 10x10 block...\n");
NP = 10;
}
else
NP = N;
printf("\n");
for (i = 0; i < NP; i++)
{
for (j = 0; j < NP; j++)
printf("%7.3f ", B[i + j * N]);
printf("\n");
}
}
// Free the matrix memory
free(A);
free(B);
return 0;
}
任何幫助或見解將不勝感激。
我將開始搜索您的答案。感謝您的迴應! – mjswartz 2013-02-20 16:46:36
其實,如果你能指出我的問題主題,那將非常感謝。 26頁的答案是很多瀏覽! – mjswartz 2013-02-20 16:57:45
這裏是一些天真矩陣乘法的討論:(http://stackoverflow.com/questions/7905760/matrix-multiplication-small-difference-in-matrix-size-large-difference-in-timi)。您的案例的機制有點不同,因爲MKL會執行緩存阻止,但您遇到基本相同的現象。今天晚些時候我會添加更多細節。 – 2013-02-20 17:09:08