我還沒有找到任何關於此主題的明確基準,所以我做了一個。如果有人正在尋找像我這樣的人,我會在這裏發佈。SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark
雖然我有一個問題。不是SSE應該比循環中的四個fpu RSQRT快4倍嗎?它速度更快,但僅爲1.5倍。移動到SSE寄存器有這麼多的影響,因爲我不做很多計算,但只有rsqrt?或者是因爲SSE rsqrt更精確,我如何找到rsqrt做了多少次迭代?兩個結果:
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
編輯
上的AMD Athlon II X2使用MSVC 11 /GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi
編譯270
測試代碼:使用浮子類型
#include <iostream>
#include <chrono>
#include <th/thutility.h>
int main(void)
{
float i;
//long i;
float res;
__declspec(align(16)) float var[4] = {0};
auto t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
res = sqrt(i);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, res);
res *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "3 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
thutility::math::rsqrt(i, var[3]);
var[3] *= i + 3;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ss(cache);
cache = _mm_mul_ss(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 SSE align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()
<< "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
var[1] = i + 1;
var[2] = i + 2;
var[3] = i + 3;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ps(cache);
cache = _mm_mul_ps(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 SSE align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - "
<< var[1] << " - " << var[2] << " - " << var[3] << std::endl;
system("PAUSE");
}
結果:
1 float SQRT: 24996us 2236.07
1 float RSQRT: 28003us 2236.07
1 align16 float[4] RSQRT: 32004us 2236.07
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
1 SSE align16 float[4] RSQRT: 46999us 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
我的結論並不是不值得與SSE2打擾,除非我們計算不少於4個變量。 (也許這僅適用於rsqrt在這裏,但它是一個昂貴的計算(這也包括多個乘法),所以它可能適用於其他的計算太)
同樣的sqrt(x)是大於x * rsqrt更快(X)有兩次迭代,並且一次迭代的x * rsqrt(x)對於距離計算來說太不準確。
因此,我在某些電路板上看到x * rsqrt(x)快於sqrt(x)的聲明是錯誤的。所以這是不合邏輯的,並且不值得精確度損失使用rsqrt而不是sqrt,除非你直接需要1/x ^(1/2)。
嘗試沒有SSE2標誌(如果它在正常的rsqrt循環上應用SSE,它會得到相同的結果)。
我的RSQRT是quake rsqrt的修改(相同)版本。
namespace thutility
{
namespace math
{
void rsqrt(const float& number, float& res)
{
const float threehalfs = 1.5F;
const float x2 = number * 0.5F;
res = number;
uint32_t& i = *reinterpret_cast<uint32_t *>(&res); // evil floating point bit level hacking
i = 0x5f3759df - (i >> 1); // what the fuck?
res = res * (threehalfs - (x2 * res * res)); // 1st iteration
res = res * (threehalfs - (x2 * res * res)); // 2nd iteration, this can be removed
}
}
}
呃,你想比較什麼?我看到平方根和倒數平方根,以及手寫近似和標量SSE指令以及SIMD SSE指令和標準庫實現。你試圖比較哪一個,以及哪些結果令你感到驚訝? – jalf 2013-03-02 14:45:43
對我而言,令人驚訝的部分是在循環中用4次迭代4次手動編碼rsqrt近似值。它不是比SSE2慢4倍嗎? 我也注意到我的SSE結果是錯誤的。這是爲什麼? – Etherealone 2013-03-02 14:50:33
在每種情況下('_mm_rsqrt_ss'而不是'_mm_rsqrt_ps'),它看起來像是在調用標量SSE指令。我錯過了什麼嗎? – jalf 2013-03-02 14:51:08