SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark

我還沒有找到任何關於此主題的明確基準，所以我做了一個。如果有人正在尋找像我這樣的人，我會在這裏發佈。SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark

雖然我有一個問題。不是SSE應該比循環中的四個fpu RSQRT快4倍嗎？它速度更快，但僅爲1.5倍。移動到SSE寄存器有這麼多的影響，因爲我不做很多計算，但只有rsqrt？或者是因爲SSE rsqrt更精確，我如何找到rsqrt做了多少次迭代？兩個結果：

4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07 
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

編輯

上的AMD Athlon II X2使用MSVC 11 /GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi編譯270

測試代碼：使用浮子類型

#include <iostream> 
#include <chrono> 
#include <th/thutility.h> 

int main(void) 
{ 
    float i; 
    //long i; 
    float res; 
    __declspec(align(16)) float var[4] = {0}; 

    auto t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
     res = sqrt(i); 
    auto t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     thutility::math::rsqrt(i, res); 
     res *= i; 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     thutility::math::rsqrt(i, var[0]); 
     var[0] *= i; 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     thutility::math::rsqrt(i, var[0]); 
     var[0] *= i; 
     thutility::math::rsqrt(i, var[1]); 
     var[1] *= i + 1; 
     thutility::math::rsqrt(i, var[2]); 
     var[2] *= i + 2; 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "3 align16 float[4] RSQRT: " 
     << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
     << var[0] << " - " << var[1] << " - " << var[2] << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     thutility::math::rsqrt(i, var[0]); 
     var[0] *= i; 
     thutility::math::rsqrt(i, var[1]); 
     var[1] *= i + 1; 
     thutility::math::rsqrt(i, var[2]); 
     var[2] *= i + 2; 
     thutility::math::rsqrt(i, var[3]); 
     var[3] *= i + 3; 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "4 align16 float[4] RSQRT: " 
     << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
     << var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     var[0] = i; 
     __m128& cache = reinterpret_cast<__m128&>(var); 
     __m128 mmsqrt = _mm_rsqrt_ss(cache); 
     cache = _mm_mul_ss(cache, mmsqrt); 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "1 SSE align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() 
     << "us " << var[0] << std::endl; 

    t1 = std::chrono::high_resolution_clock::now(); 
    for(i = 0; i < 5000000; i+=1) 
    { 
     var[0] = i; 
     var[1] = i + 1; 
     var[2] = i + 2; 
     var[3] = i + 3; 
     __m128& cache = reinterpret_cast<__m128&>(var); 
     __m128 mmsqrt = _mm_rsqrt_ps(cache); 
     cache = _mm_mul_ps(cache, mmsqrt); 
    } 
    t2 = std::chrono::high_resolution_clock::now(); 
    std::cout << "4 SSE align16 float[4] RSQRT: " 
     << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - " 
     << var[1] << " - " << var[2] << " - " << var[3] << std::endl; 

    system("PAUSE"); 
}

結果：

1 float SQRT: 24996us 2236.07 
1 float RSQRT: 28003us 2236.07 
1 align16 float[4] RSQRT: 32004us 2236.07 
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006 
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07 
1 SSE align16 float[4] RSQRT: 46999us 2236.07 
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

我的結論並不是不值得與SSE2打擾，除非我們計算不少於4個變量。（也許這僅適用於rsqrt在這裏，但它是一個昂貴的計算（這也包括多個乘法），所以它可能適用於其他的計算太）

同樣的sqrt（x）是大於x * rsqrt更快（X）有兩次迭代，並且一次迭代的x * rsqrt（x）對於距離計算來說太不準確。

因此，我在某些電路板上看到x * rsqrt（x）快於sqrt（x）的聲明是錯誤的。所以這是不合邏輯的，並且不值得精確度損失使用rsqrt而不是sqrt，除非你直接需要1/x ^（1/2）。

嘗試沒有SSE2標誌（如果它在正常的rsqrt循環上應用SSE，它會得到相同的結果）。

我的RSQRT是quake rsqrt的修改（相同）版本。

namespace thutility 
{ 
    namespace math 
    { 
     void rsqrt(const float& number, float& res) 
     { 
       const float threehalfs = 1.5F; 
       const float x2 = number * 0.5F; 

       res = number; 
       uint32_t& i = *reinterpret_cast<uint32_t *>(&res); // evil floating point bit level hacking 
       i = 0x5f3759df - (i >> 1);        // what the fuck? 
       res = res * (threehalfs - (x2 * res * res)); // 1st iteration 
       res = res * (threehalfs - (x2 * res * res)); // 2nd iteration, this can be removed 
     } 
    } 
}

來源

2013-03-02 Etherealone

呃，你想比較什麼？我看到平方根和倒數平方根，以及手寫近似和標量SSE指令以及SIMD SSE指令和標準庫實現。你試圖比較哪一個，以及哪些結果令你感到驚訝？ – jalf 2013-03-02 14:45:43

對我而言，令人驚訝的部分是在循環中用4次迭代4次手動編碼rsqrt近似值。它不是比SSE2慢4倍嗎？我也注意到我的SSE結果是錯誤的。這是爲什麼？ – Etherealone 2013-03-02 14:50:33

在每種情況下（'_mm_rsqrt_ss'而不是'_mm_rsqrt_ps'），它看起來像是在調用標量SSE指令。我錯過了什麼嗎？ – jalf 2013-03-02 14:51:08

在SSE代碼中很容易得到大量不必要的開銷。

如果你想確保你的代碼是有效的，看看編譯器的反彙編。有一件事情經常會導致性能下降（它看起來可能會影響你），這就是不必要地在內存和SSE寄存器之間移動數據。

在循環中，您應該將所有相關數據以及結果保存在SSE寄存器中，而不是在float[4]中。

只要您正在訪問內存，請驗證編譯器是否生成對齊的移動指令以將數據加載到寄存器或將其寫回到陣列。

然後檢查生成的SSE指令是否在它們之間沒有大量不必要的移動指令和其他垃圾信息。有些編譯器在從內部函數生成SSE代碼時非常糟糕，因此需要密切關注它生成的代碼。

最後，您需要查閱您的CPU手冊/規格說明以確保它實際上執行的打包指令與使用標量指令一樣快。（對於現代CPU，我相信他們這樣做，但一些較老的CPU至少需要一點額外的時間來處理指令，而不是標量指令的四倍，但足以使你達不到4倍的加速比）

來源

2013-03-02 15:02:49 jalf

是否有內在的告訴將變量保存在一個寄存器中，或者我必須爲此編寫內聯彙編？ – Etherealone 2013-03-02 15:05:58

@Tolga使用'__m128'類型的變量。它不保證變量會保存在寄存器中，但可能（使用'float [4]'不可能） – harold 2013-03-02 17:42:47

是的。也不要將它們放在一個帶有「float [4]」的工會中。別讓別人別名。理想情況下，儘可能將其聲明爲儘可能接近使用站點，並且不要在之後將其用於其他事情。其範圍越小，編譯器就越容易確定它不會別名，也不需要寫入內存。 – jalf 2013-03-02 21:33:57

我的結論不是不值得用SSE2打擾，除非我們計算不少於4個變量。（也許這隻適用於rsqrt這裏，但它是一個昂貴的計算（它也包括多個乘法），所以它可能也適用於其他計算）

另外sqrt（x）比x * rsqrt兩次迭代，並且一次迭代的x * rsqrt（x）對於距離計算來說太不準確。

來源

2013-03-02 15:02:15 Etherealone

你剛剛發現的是貨物崇拜編程很少起作用。由於知道Carmack在10年前使用了一種巧妙的近似方法，因爲當時CPU在某些領域有缺陷，如果今天使用相同的技巧，不會奇蹟般地使您的代碼更快。 :) – jalf 2013-03-02 21:30:03

SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark

回答

相關問題