OpenMP/C++：並行循環與事後還原 - 最佳實踐？

考慮下面的代碼...OpenMP/C++：並行循環與事後還原 - 最佳實踐？

for (size_t i = 0; i < clusters.size(); ++i) 
{ 
    const std::set<int>& cluster = clusters[i]; 
    // ... expensive calculations ... 
    for (int j : cluster) 
     velocity[j] += f(j); 
}

...我想在多個CPU /內核上運行。功能f不使用velocity。

一個簡單的#pragma omp parallel for在第一個for循環之前會產生不可預知/錯誤的結果，因爲在內部循環中修改了std::vector<T> velocity。多個線程可以訪問並（嘗試）同時修改同一個元素velocity。

我認爲第一個解決方案是在velocity[j] += f(j);操作之前編寫#pragma omp atomic。這給我一個編譯錯誤（可能與Eigen::Vector3d類型的元素或velocity是類成員有關）。此外，我讀原子操作非常緩慢相比，每個線程有一個私人變量，並在最後減少。所以這就是我想要做的，我想。

我想出了這一點：

#pragma omp parallel 
{ 
    // these variables are local to each thread 
    std::vector<Eigen::Vector3d> velocity_local(velocity.size()); 
    std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0)); 

    #pragma omp for 
    for (size_t i = 0; i < clusters.size(); ++i) 
    { 
     const std::set<int>& cluster = clusters[i]; 
     // ... expensive calculations ... 
     for (int j : cluster) 
      velocity_local[j] += f(j); // save results from the previous calculations 
    } 

    // now each thread can save its results to the global variable 
    #pragma omp critical 
    { 
     for (size_t i = 0; i < velocity_local.size(); ++i) 
      velocity[i] += velocity_local[i]; 
    } 
}

這是一個很好的解決方案嗎？是否最佳解決方案？（是否正確？）

進一步的想法：使用reduce子句（而不是critical部分）會引發編譯器錯誤。我認爲這是因爲velocity是一個班級成員。

我試圖找到一個類似問題的問題，並且this問題看起來像是幾乎相同。但我認爲我的案例可能不同，因爲最後一步包括for循環。此外，這是否是最好的方法仍然成立。

編輯：按註釋要求：reduction條款等

#pragma omp parallel reduction(+:velocity) 
    for (omp_int i = 0; i < velocity_local.size(); ++i) 
     velocity[i] += velocity_local[i];

...引發以下錯誤：

錯誤C3028： 'ShapeMatching ::速度'：僅可變數據或靜態數據成員可用於數據共享條款

（與g++類似的錯誤）

來源

2015-06-19 Micha

分享使用減少錯誤，從而使修復可以建議的代碼。 – Jeff

@Jeff完成了。 [足夠的字符] – Micha

你有沒有考慮過PPL？編寫「自減少數據」的代碼在那裏很靈活，並且不一定是原語。基本上，你描述了線程加載數據是什麼，以及如何組合兩個線程本地數據，剩下的就完成了。 – Yakk

你正在做一個數組減少。我已多次描述過這一點（例如reducing an array in openmp和fill histograms array reduction in parallel with openmp without using a critical section）。您可以在有或沒有關鍵部分的情況下執行此操作。

您已經在關鍵部分（在您最近的編輯中）正確完成了這個任務，所以讓我介紹如何在沒有關鍵部分的情況下執行此操作。

std::vector<Eigen::Vector3d> velocitya; 
#pragma omp parallel 
{ 
    const int nthreads = omp_get_num_threads(); 
    const int ithread = omp_get_thread_num(); 
    const int vsize = velocity.size(); 

    #pragma omp single 
    velocitya.resize(vsize*nthreads); 
    std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1), 
       Eigen::Vector3d(0,0,0)); 

    #pragma omp for schedule(static) 
    for (size_t i = 0; i < clusters.size(); i++) { 
     const std::set<int>& cluster = clusters[i]; 
     // ... expensive calculations ... 
     for (int j : cluster) velocitya[ithread*vsize+j] += f(j); 
    } 

    #pragma omp for schedule(static) 
    for(int i=0; i<vsize; i++) { 
     for(int t=0; t<nthreads; t++) { 
      velocity[i] += velocitya[vsize*t + i]; 
     } 
    } 
}

這種方法需要格外小心/調諧由於我沒有做過假共享。

至於哪種方法更好，你將不得不測試。

來源

2015-06-23 12:33:25

OpenMP/C++：並行循環與事後還原 - 最佳實踐？

回答

相關問題