tbb :: parallel_reduce和std :: accumulate的結果不同

我在學習Intel's TBB library。當對std::vector中的所有值求和時，tbb::parallel_reduce的結果在向量中的元素多於16.777.220個元素（在16.777.320元素處出現錯誤）的情況下與std::accumulate不同。這裏是我的最低工作，例如：tbb :: parallel_reduce和std :: accumulate的結果不同

#include <iostream> 
#include <vector> 
#include <numeric> 
#include <limits> 
#include "tbb/tbb.h" 

int main(int argc, const char * argv[]) { 

    int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works 

    std::vector<float> heights(size); 
    std::fill(heights.begin(), heights.end(), 1.0f); 

    float ssum = std::accumulate(heights.begin(), heights.end(), 0); 
    float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0, 
             [](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) { 
              return std::accumulate(range.begin(), range.end(), init); 
             }, std::plus<float>() 
            ); 

    std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum; 
    return 0; 
}

這對我的OSX 10.10.3輸出具有的XCode 6.3.1和TBB穩定4.3-20141023（從Brew倒）：

Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07

爲什麼就是它？我應該向TBB開發者報告錯誤嗎？

附加測試，將您的答案：

correct value is: 1949700403 
cause we add 1.0f to zero 1949700403 times 

using (int) init values: 
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong 
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong 

using (float) init values: 
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong 
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong 

using (double) initial values: 
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong 
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong 

using (double) initial values and tbb::parallel_deterministic_reduce: 
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong 
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !

爲什麼所有減少調用產生錯誤的總和？ (double)不夠？ 這裏是我的測試代碼：

#include <iostream> 
    #include <vector> 
    #include <numeric> 
    #include <limits> 
    #include <sys/time.h> 
    #include <iomanip> 
    #include "tbb/tbb.h" 
    #include <cmath> 

    class StopWatch { 
    private: 
     double elapsedTime; 
     timeval startTime, endTime; 
    public: 
     StopWatch() : elapsedTime(0) {} 
     void startTimer() { 
      elapsedTime = 0; 
      gettimeofday(&startTime, 0); 
     } 
     void stopNprintTimer() { 
      gettimeofday(&endTime, 0); 
      elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0;    // compute sec to ms 
      elapsedTime += (endTime.tv_usec - startTime.tv_usec)/1000.0;   // compute us to ms and add 
      std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime/1000 << " sec.";    // show in sec 
     } 
    }; 

    int main(int argc, const char * argv[]) { 

     StopWatch watch; 
     std::cout << std::fixed << std::setprecision(3) << "" << std::endl; 
     size_t count = std::numeric_limits<int>::max() * 0.9079; 

     std::vector<float> heights(count); 
     std::cout << " Vector size: " << count << std::endl; 
     std::fill(heights.begin(), heights.end(), 1.0f); 

     watch.startTimer(); 
     float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here 
     watch.stopNprintTimer(); 
     std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl; 

     watch.startTimer(); 
     float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here 
              [](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) { 
               return std::accumulate(range.begin(), range.end(), init); 
              }, std::plus<float>() 
             ); 
     watch.stopNprintTimer(); 
     std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl; 

     return 0; 
    }

回答我的最後一個問題：它們都產生錯誤的結果，因爲他們沒有爲整數加法與大量製造。切換到INT解決了：

[...] 
std::vector<int> heights(count); 
std::cout << " Vector size: " << count << std::endl; 
std::fill(heights.begin(), heights.end(), 1); 

watch.startTimer(); 
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0); 
watch.stopNprintTimer(); 
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl; 

watch.startTimer(); 
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0, 
            [](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) { 
             return std::accumulate(range.begin(), range.end(), init); 
            }, std::plus<int>() 
           ); 
watch.stopNprintTimer(); 
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl; 
[...]

結果：

Vector size: 1949700403 
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct 
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster

來源

2015-05-05 VisorZ

浮點運算不是實數運算。如果您更改操作順序，您可能會得到不同的舍入錯誤。 – DanielKO

您對std::accumulate呼叫正在做整數加法，那麼在計算最終結果轉化到float。爲了累積浮點數，累加器應該是float^*。

float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f); 
                  ^^^^

^{*或任何其他類型可以正確累積float。}

來源

2015-05-05 12:45:41 juanchopanza

謝謝。正如std :: accumulate模板語法所示，我在這裏使用了一個int值，只有幸運的是，我用1.0f填充了我的矢量，當它轉換爲int時，它是1。當使用浮點值時，結果仍然不正確。但是這次由於float數據類型在較高數字區域中的不準確性。 – VisorZ

這可能會解決這方面的問題給你：

您的電話到std ::積累是做整數加法，然後將結果變換到漂浮在計算結束。

但浮點加法是不是關聯操作：

隨着累加：（...（（S + A1）+ A2）+ ...）+一個
隨着parralel_reduce ：可能的任何括號排列。

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

來源

2015-05-05 14:04:09

感謝您指出浮動精度問題，並鏈接到偉大的文檔（upvoted）。 – VisorZ

爲其他正確答案爲 '爲什麼？'部分，我還補充說，TBB提供了parallel_deterministic_reduce，它保證了在相同數據的兩次和多次運行之間可重現的結果（但它仍然可以與std :: accumulate不同）。請參閱the blog描述問題和確定性算法。

因此，關於'我應該向TBB開發者報告錯誤嗎？'部分，答案顯然不是（除非你在TBB方面發現不足）。

來源

2015-05-05 14:45:24 Anton

謝謝你的提示。不幸的是，它需要更多的時間在我的4線程英特爾i7比串行std :: accumulate（）與雙類型的初始值。 – VisorZ

仔細閱讀鏈接我現在明白，'tbb :: parallel_deterministic_reduce'不會產生正確的結果，但至少會重複出現錯誤的結果，這意味着每次運行都會產生相同的錯誤。我可以引用：_重要的是要注意，使用parallel_deterministic_reduce獲得的可重複結果可能仍然不同於通過串行執行獲得的結果。 [..]此外，該算法不是爲了提高計算的準確性。_ – VisorZ

這不是一個錯誤。或者可以說std :: accumulate – Anton

tbb :: parallel_reduce和std :: accumulate的結果不同

回答

相關問題