我寫了一個矩陣向量乘法的代碼。矩陣根據線程的數量劃分成若干行,每個塊乘以向量,向量存儲在線程專用的數組中。但是我的加速非常糟糕。對於大小爲16×16的矩陣,它低於1.OpenMp代碼的性能
這是否可以歸因於以下事實:我將外部矩陣和向量聲明爲共享變量,並且可能在每個線程試圖讀取時導致競爭條件/錯誤共享矩陣和向量的值?
我有點混淆錯誤分享和競爭條件。
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#define SIZE 128 // The size should be divisible by thenumber of threads
int main(int argc, char *argv[]) {
int thread_count = strtol(argv[1],NULL,10);
// Declare the variables
int i,j;
long A[SIZE][SIZE], b[SIZE],V[SIZE]={0};
//long Vect[SIZE]={0};
double start, end;
// Generate a matrix of size mxm
for (i=0; i<SIZE; i++)
{ for (j=0; j<SIZE; j++)
A[i][j] = i+j;
}
printf("The Matrix is:\n");
// Print the Matrix
for (i=0; i<SIZE; i++)
{ for (j=0; j<SIZE; j++)
{
printf("%12ld", A[i][j]);
}
printf("\n");
}
// Generate a vector of size m
for (i=0; i<SIZE; i++)
b[i] = i;
printf("The vector is: \n");
// Print a vector
for (i=0; i<SIZE; i++)
printf("%12ld\n", b[i]);
start = omp_get_wtime();
//omp_set_num_threads(NUM_THREADS);
#pragma omp parallel num_threads(thread_count)
{
int i,j,k, id, nthrds;
long Vect[SIZE]={0};
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=id*SIZE/nthrds; i<(id*SIZE/nthrds + SIZE/nthrds); i++)
{ Vect[i] = 0;
{
for (j=0; j<SIZE; j++)
Vect[i] += A[i][j]*b[j];
}
}
#pragma omp critical
{
for (k=0; k<SIZE; k++)
V[k] += Vect[k];
}
}
end = omp_get_wtime();
printf("The vector obtained after multiplication is:\n");
for (i=0; i<SIZE; i++)
printf("%12ld\n", V[i]);
printf("The time taken for calculation is: %lf\n", end - start);
return 0;
}
這很可能是一個工作量小(每個線程只做256/num_thread乘加),設定的開銷多線程並行化的速度比並行化的速度更快。是的,在線程之間共享寫入狀態很可能使並行化開銷更高。 – aruisdante 2015-02-24 18:04:32
欲瞭解更多關於虛假分享:http://stackoverflow.com/questions/9027653/openmp-false-sharing?rq=1。對於一般的OpenMP性能的一些有趣的討論:http://stackoverflow.com/questions/10939158/openmp-performance?rq=1 – aruisdante 2015-02-24 18:10:44
@aruisdante沒有共享寫入,有共享讀取 – 2015-02-24 21:39:31