爲什麼我不是分支預測的受害者？

我正在寫一個函數來創建一個高斯過濾器（使用犰狳庫），它可以是2D或3D，具體取決於它接收到的輸入的維數。下面是代碼：爲什麼我不是分支預測的受害者？

template <class ty> 
ty gaussianFilter(const ty& input, double sigma) 
{ 
    // Our filter will be initialized to the same size as our input. 
    ty filter = ty(input); // Copy constructor. 

    uword nRows = filter.n_rows; 
    uword nCols = filter.n_cols; 
    uword nSlic = filter.n_elem/(nRows*nCols); // If 2D, nSlic == 1. 

    // Offsets with respect to the middle. 
    double rowOffset = static_cast<double>(nRows/2); 
    double colOffset = static_cast<double>(nCols/2); 
    double sliceOffset = static_cast<double>(nSlic/2); 

    // Counters. 
    double x = 0 , y = 0, z = 0; 

for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) { 
     x = static_cast<double>(rowIndex) - rowOffset; 
     for (uword colIndex = 0; colIndex < nCols; colIndex++) { 
     y = static_cast<double>(colIndex) - colOffset; 
     for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) { 
      z = static_cast<double>(sliIndex) - sliceOffset; 
      // If-statement inside for-loop looks terribly inefficient 
      // but the compiler should take care of this. 
      if (nSlic == 1){ // If 2D, Gauss filter for 2D. 
      filter(rowIndex*nCols + colIndex) = ... 
      } 
      else 
      { // Gauss filter for 3D. 
      filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ... 
      } 
     }  
    } 
}

正如我們看到的，有一個if語句最內環路內，檢查如果第三尺寸（nSlic）的大小等於1。一旦在開始時計算的函數，nSlic不會改變它的值，所以編譯器應該足夠聰明來優化條件分支，而且我不應該失去任何性能。

但是...如果我從循環中刪除if語句，我得到的性能提升。

if (nSlic == 1) 
    { // Gauss filter for 2D. 
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) { 
     x = static_cast<double>(rowIndex) - rowOffset; 
     for (uword colIndex = 0; colIndex < nCols; colIndex++) { 
     y = static_cast<double>(colIndex) - colOffset; 
     for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) { 
      z = static_cast<double>(sliIndex) - sliceOffset; 
      {filter(rowIndex*nCols + colIndex) = ... 
     } 
     } 
    } 
    } 
else 
    { 
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) { 
     x = static_cast<double>(rowIndex) - rowOffset; 
     for (uword colIndex = 0; colIndex < nCols; colIndex++) { 
     y = static_cast<double>(colIndex) - colOffset; 
     for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) { 
      z = static_cast<double>(sliIndex) - sliceOffset; 
      {filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...          
     } 
     } 
    } 
    }

與g++ -O3 -c -o main.o main.cpp編譯和同時測量代碼變化的執行時間後，我得到了以下：
（1000重複，大小2048的2D矩陣）

如果-內：

66.0453秒
64.7701秒

如果-外：

64.0148秒
63.6808秒

爲什麼不編譯器優化分支如果nSlic的價值也不會改變？我必須重構代碼以避免for -loop中的if -statement？

來源

2015-12-14 dangom

我很困惑你在問什麼。您將if語句移出嵌套循環並驚訝於您的代碼運行速度更快？你期望編譯器將你的第一版代碼轉換爲第二版嗎？ – Kevin

我相信，如果'if'語句總會產生相同的結果，編譯器會優化它。我的假設來自[排序與未排序數組]（http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array）。我想了解爲什麼不是這種情況，以及何時可以期待這樣的編譯器優化。 – dangom

哦，我明白了。這不是編譯器的工作。處理器處理分支預測。 – Kevin

具有循環一個額外的變量將影響寄存器使用，這可能會影響時序，即使分支預測是否工作正常。您需要查看生成的程序集才能知道。它也可能會影響很難檢測到的緩存命中率。

來源

2015-12-14 23:58:16

你的錯誤是在這裏：

優化條件分支，我不應該失去任何性能

分支預測可以幫助你很多，比較實際執行與未知分支關聯的管道失速。但它仍然是一個額外的指令，仍然有成本。處理器魔術減少了無用代碼的成本......大大降低但不爲零。

來源

2015-12-14 23:47:55

編譯器和硬件之間的相互作用是這樣的 - 編譯器可能能夠優化分支走，使自身優化的代碼，但你可以看到，產生了大量的代碼膨脹，因爲它有效地複製整個循環。有些編譯器默認情況下可能包含此優化，而其他編譯器可能會要求您明確要求您完成。

或者，如果編譯器避免了這種優化，那麼代碼將保留該分支，並且HW將盡可能最好地預測它。這涉及複雜的分支預測器，其具有有限的表格，並因此限制了他們可以達到的學習量。在這個例子中，你沒有太多的競爭分支（循環，函數調用和返回，以及如果我們正在討論），但我們沒有看到調用函數的內部工作，它可能有更多的分支指令（沖淡你在外面學到的東西），或者它可能足以沖刷預測變量可能使用的任何全局歷史。沒有看到代碼很難說，也不知道你的分支預測器究竟做了什麼（這取決於你使用的CPU版本）。

還有一點需要注意 - 它可能不一定與分支預測有關，改變這樣的代碼可能會改變代碼緩存中的對齊或某些用於優化循環的內部循環緩衝區（如this），這可能會導致劇烈變化在表現。要知道的唯一方法是根據HW計數器（perf，vtune等）運行一些分析，並測量分支數量和誤預測的變化。

來源

2015-12-23 00:42:07 Leeor

爲什麼我不是分支預測的受害者？

回答

相關問題