OpenMP：堆陣列性能不佳（堆棧陣列工作正常）

我是一個相當有經驗的OpenMP用戶，但我遇到了一個令人困惑的問題，我希望這裏有人能夠提供幫助。問題是一個簡單的哈希算法對堆棧分配的數組執行效果很好，但對堆中的數組效果不佳。OpenMP：堆陣列性能不佳（堆棧陣列工作正常）

以下示例使用i％M（i模數M）來計算各個數組元素中的每個第M個整數。爲了簡單起見，設想N = 1000000，M = 10。如果N％M == 0，則結果應該是倉[]中的每個元素是等於N/M：

#pragma omp for 
    for (int i=0; i<N; i++) 
    bins[ i%M ]++;

陣列倉[]是私有的每個線程的所有線程的（I總和結果之後的關鍵部分）。

當在堆棧上分配了bins []時，該程序效果很好，性能與核心數成正比。但是，如果bin []位於堆棧上（指向bin []的指針位於堆棧上），則性能會急劇下降。這是一個主要問題！

我想要使用OpenMP將某些數據的binning（hashing）並行化成堆數組，這是一個主要的性能問題。

這絕對不是像所有線程試圖寫入同一個內存區域一樣愚蠢。這是因爲每個線程都有自己的bin []數組，堆和堆棧分配的結果都是正確的，並且單線程運行的性能沒有差別。我使用GCC和Intel C++編譯器在不同的硬件（Intel Xeon和AMD Opteron）上重現了這個問題。所有測試都在Linux（Ubuntu和RedHat）上進行。

似乎沒有理由將OpenMP的良好性能限制爲堆棧陣列。

任何猜測？也許線程訪問堆通過Linux上的某種共享網關？我如何解決這個問題？

完整的程序一起玩周圍低於：

#include <stdlib.h> 
#include <stdio.h> 
#include <omp.h> 

int main(const int argc, const char* argv[]) 
{ 
    const int N=1024*1024*1024; 
    const int M=4; 
    double t1, t2; 
    int checksum=0; 

    printf("OpenMP threads: %d\n", omp_get_max_threads()); 

    ////////////////////////////////////////////////////////////////// 
    // Case 1: stack-allocated array 
    t1=omp_get_wtime(); 
    checksum=0; 
#pragma omp parallel 
    { // Each openmp thread should have a private copy of 
    // bins_thread_stack on the stack: 
    int bins_thread_stack[M]; 
    for (int j=0; j<M; j++) bins_thread_stack[j]=0; 
#pragma omp for 
    for (int i=0; i<N; i++) 
     { // Accumulating every M-th number in respective array element 
     const int j=i%M; 
     bins_thread_stack[j]++; 
     } 
#pragma omp critical 
    for (int j=0; j<M; j++) checksum+=bins_thread_stack[j]; 
    } 
    t2=omp_get_wtime(); 
    printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N); 
    ////////////////////////////////////////////////////////////////// 

    ////////////////////////////////////////////////////////////////// 
    // Case 2: heap-allocated array 
    t1=omp_get_wtime(); 
    checksum=0; 
    #pragma omp parallel 
    { // Each openmp thread should have a private copy of 
    // bins_thread_heap on the heap: 
    int* bins_thread_heap=(int*)malloc(sizeof(int)*M); 
    for (int j=0; j<M; j++) bins_thread_heap[j]=0; 
    #pragma omp for 
    for (int i=0; i<N; i++) 
     { // Accumulating every M-th number in respective array element 
     const int j=i%M; 
     bins_thread_heap[j]++; 
     } 
    #pragma omp critical 
    for (int j=0; j<M; j++) checksum+=bins_thread_heap[j]; 
    free(bins_thread_heap); 
    } 
    t2=omp_get_wtime(); 
    printf("Time with heap array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N); 
    ////////////////////////////////////////////////////////////////// 

    return 0; 
}

程序的樣本輸出是如下：

爲OMP_NUM_THREADS = 1

OpenMP threads: 1 
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824). 
Time with heap array: 3.091 sec, checksum=1073741824 (must be 1073741824).

和OMP_NUM_THREADS = 10

OpenMP threads: 10 
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824). 
Time with heap array: 2.150 sec, checksum=1073741824 (must be 1073741824).

我非常感謝任何幫助！

來源

2011-07-07 drlemon

這是一個可愛的問題：使用上面的代碼（gcc4。4，英特爾酷睿i7）有4個線程，我得到

OpenMP threads: 4 
Time with stack array:  1.696 sec, checksum=1073741824 (must be 1073741824). 
Time with heap array:  5.413 sec, checksum=1073741824 (must be 1073741824).

，但如果我malloc的線路改變爲

int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);

（更新：甚至

int* bins_thread_heap=(int*)malloc(sizeof(int)*16);

）

然後我得到

OpenMP threads: 4 
Time with stack array:  1.578 sec, checksum=1073741824 (must be 1073741824). 
Time with heap array:  1.574 sec, checksum=1073741824 (must be 1073741824).

這裏的問題是false sharing。默認的malloc非常（空間）高效，並且將所請求的小分配全部放在一塊內存中，彼此相鄰;但是由於分配太小以至於多個適配在同一個緩存行中，這意味着每當一個線程更新其值時，它就會使相鄰線程中的值的緩存行變髒。通過使請求的內存足夠大，這不再是問題。

順便說一句，應該清楚爲什麼堆棧分配的情況下沒有看到這個問題;不同的線程 - 不同的堆棧 - 內存足夠遠以至於虛假共享不成問題。作爲一個觀點 - 對於你在這裏使用的大小的M來說並不重要，但是如果你的M（或者線程的數量）更大，那麼omp臨界值將是一個很大的系列瓶頸;你可以使用OpenMP reductions來更有效地求和校驗和

#pragma omp parallel reduction(+:checksum) 
    { // Each openmp thread should have a private copy of 
     // bins_thread_heap on the heap: 
     int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024); 
     for (int j=0; j<M; j++) bins_thread_heap[j]=0; 
#pragma omp for 
     for (int i=0; i<N; i++) 
     { // Accumulating every M-th number in respective array element 
      const int j=i%M; 
      bins_thread_heap[j]++; 
     } 
     for (int j=0; j<M; j++) 
      checksum+=bins_thread_heap[j]; 
     free(bins_thread_heap); 
}

來源

2011-07-07 13:47:10

這很棒，喬納森，謝謝！那麼這是否意味着有效使用堆的唯一方法是通過浪費它？也許某些OpenMP的實現有一個特殊的malloc函數，我將不得不進行研究。順便說一句，你說關鍵塊是一個瓶頸是不正確的。關鍵塊在我的並行部分的末尾，而不在for循環內。事實上，「減少」條款通過完成這一步驟來實現減少，在並行部分的末尾放置一個關鍵塊。但是，謝謝你的領導！ – drlemon

啊，但（a）關鍵是一個非常重量級的操作，並且（b）它比所需的更粗糙 - 您可以先執行您的局部總和，然後執行關鍵操作（或更好的原子操作）來更新全局和。但即使如此，大量線程的減少仍然會更快，因爲最終減少可以分層次完成（以ln（線程數）爲時間，而不是（線程數））。 –

關於高效使用堆 - 避免錯誤共享是所有共享內存操作通用的問題，並且避免它的唯一方法是確保您具有至少與緩存線分開的不相交的內存塊。該間距的大小將取決於系統;使它多K是矯枉過正，通常512字節左右將做到這一點。 –

OpenMP：堆陣列性能不佳（堆棧陣列工作正常）

回答

相關問題