OpenMP程序中的低性能

我想了解來自here的openmp代碼。你可以看到下面的代碼。OpenMP程序中的低性能

爲了測量串行和OMP的版本，我用time.h中，你找到正確的這種方法之間的加速比，差別？
程序在4核心機器上運行。我指定export OMP_NUM_THREADS="4"，但看不到實質性的加速，通常我會得到1.2 - 1.7。在這種並行化中我面臨哪些問題？
我可以使用哪種調試/執行工具來查看性能的損失？

代碼（編譯我用xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe）

#include <omp.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <sys/time.h> 
#define CHUNKSIZE 1000000 
#define N  100000000 

int main (int argc, char *argv[]) 
{ 
    int nthreads, tid, i, chunk; 
    float a[N], b[N], c[N]; 
    unsigned long elapsed; 
    unsigned long elapsed_serial; 
    unsigned long elapsed_omp; 
    struct timeval start; 
    struct timeval stop; 


    chunk = CHUNKSIZE; 

    // ================= SERIAL  start ======================= 
    /* Some initializations */ 
    for (i=0; i < N; i++) 
     a[i] = b[i] = i * 1.0; 
    gettimeofday(&start,NULL); 
    for (i=0; i<N; i++) 
    { 
     c[i] = a[i] + b[i]; 
     //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); 
    } 
    gettimeofday(&stop,NULL); 
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec); 
    elapsed += stop.tv_usec - start.tv_usec; 
    elapsed_serial = elapsed ; 
    printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial); 
    // ================= SERIAL  end ======================= 


    // ================= OMP start ======================= 
    /* Some initializations */ 
    for (i=0; i < N; i++) 
     a[i] = b[i] = i * 1.0; 
    gettimeofday(&start,NULL); 
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) 
    { 
     tid = omp_get_thread_num(); 
     if (tid == 0) 
     { 
      nthreads = omp_get_num_threads(); 
      printf("Number of threads = %d\n", nthreads); 
     } 
     //printf("Thread %d starting...\n",tid); 

#pragma omp for schedule(static,chunk) 
     for (i=0; i<N; i++) 
     { 
      c[i] = a[i] + b[i]; 
      //printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); 
     } 

    } /* end of parallel section */ 
    gettimeofday(&stop,NULL); 
    elapsed = 1000000 * (stop.tv_sec - start.tv_sec); 
    elapsed += stop.tv_usec - start.tv_usec; 
    elapsed_omp = elapsed ; 
    printf (" \n Time OMP= %lu microsecs\n", elapsed_omp); 
    // ================= OMP end ======================= 
    printf (" \n speedup= %f \n\n", ((float) elapsed_serial)/((float) elapsed_omp)) ; 

}

來源

2010-12-22 flow

您可能還想指定哪個操作系統和哪個編譯器來幫助其他人回答＃1和＃3。 – 2010-12-22 20:21:28

有沒有什麼錯誤的代碼上面，但你的加速是要通過這樣的事實限制了主循環中，C = a + b很少工作 - 執行計算所需的時間（單個加法）將由存儲器訪問時間（2個加載和一個存儲）佔據主導地位，並且隨着更多的線程操作在陣列上。

我們可以通過使循環內的工作測試這多個計算密集型：

c[i] = exp(sin(a[i])) + exp(cos(b[i]));

然後我們得到

$ ./apb 

Time SEQ= 17678571 microsecs 
Number of threads = 4 

Time OMP= 4703485 microsecs 

speedup= 3.758611

這顯然是更接近了很多4倍加速比一個會期望。

更新：哦，還有其他問題 - gettimeofday（）可能適用於定時，並且在您使用xlc的系統上 - 是AIX嗎？在這種情況下，peekperf是一個很好的整體性能工具，硬件性能監視器可以讓你訪問存儲器存取時間。在x86平臺上，用於線程代碼性能監視的免費工具包括用於高速緩存性能調試的cachegrind/valgrind（這裏不是問題），用於常規OpenMP問題的scalasca，OpenSpeedShop也非常有用。

來源

2010-12-22 21:35:37

OpenMP程序中的低性能

回答

相關問題