2010-07-09 45 views
7

通過SSE指令執行復數乘法和除法是否有利? 我知道使用SSE時,加法和減法效果會更好。有人能告訴我如何使用SSE執行復雜的乘法以獲得更好的性能嗎?複雜Mul和Div使用Sse指令

回答

9

那麼複數乘法的定義是:

((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i 

所以在複數的兩種成分是

((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i 

所以假設你使用8輛彩車代表定義了4個複數如下:

c1a, c1b, c2a, c2b 
c3a, c3b, c4a, c4b 

而你想同時做(c1 * c3)和(c2 * c4)你的SSE代碼看上去像下面這樣的「東西」:

(注意我在windows下使用MSVC,但原理相同)。

__declspec(align(16)) float c1c2[]  = { 1.0f, 2.0f, 3.0f, 4.0f }; 
__declspec(align(16)) float c3c4[]   = { 4.0f, 3.0f, 2.0f, 1.0f }; 
__declspec(align(16)) float mulfactors[] = { -1.0f, 1.0f, -1.0f, 1.0f }; 
__declspec(align(16)) float res[]   = { 0.0f, 0.0f, 0.0f, 0.0f }; 

__asm 
{ 
    movaps xmm0, xmmword ptr [c1c2]   // Load c1 and c2 into xmm0. 
    movaps xmm1, xmmword ptr [c3c4]   // Load c3 and c4 into xmm1. 
    movaps xmm4, xmmword ptr [mulfactors] // load multiplication factors into xmm4 

    movaps xmm2, xmm1      
    movaps xmm3, xmm0      
    shufps xmm2, xmm1, 0xA0     // Change order to c3a c3a c4a c4a and store in xmm2 
    shufps xmm1, xmm1, 0xF5     // Change order to c3b c3b c4b c4b and store in xmm1 
    shufps xmm3, xmm0, 0xB1     // change order to c1b c1a c2b c2a abd store in xmm3 

    mulps xmm0, xmm2       
    mulps xmm3, xmm1      
    mulps xmm3, xmm4      // Flip the signs of the 'a's so the add works correctly. 

    addps xmm0, xmm3      // Add together 

    movaps xmmword ptr [res], xmm0   // Store back out 
}; 

float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]); 
float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]); 

float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]); 
float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]); 

if (res1a != res[0] || 
    res1b != res[1] || 
    res2a != res[2] || 
    res2b != res[3]) 
{ 
    _exit(1); 
} 

我上面所做的是我簡化了一些數學。假設以下幾點:

c1a c1b c2a c2b 
c3a c3b c4a c4b 

通過重新排列我結束了以下載體

0 => c1a c1b c2a c2b 
1 => c3b c3b c4b c4b 
2 => c3a c3a c4a c4a 
3 => c1b c1a c2b c2a 

我然後乘以0和2合力得到:

0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a 

接着我乘3和1個一起得到:

3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b 

最後,我翻了幾個浮體的跡象,3

3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b 

這樣我就可以把它們相加,並得到

(c1a * c3a) - (c1b * c3b), (c1b * c3a) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b) 

這就是我們是:)

+0

參見HTTP://軟件.intel.com/file/1000,它似乎有更快的算法。 – MSalters 2010-07-09 12:04:14

+1

對我來說是類似的設置,但是他們需要SSE3 ...在我這個時代,99%的時間都可以,我承認。 – Goz 2010-07-09 13:00:59

+0

這增加了子指令看起來很方便。唉,我通常不以上述SSE2爲目標兼容性:( – Goz 2010-07-09 13:02:50

8

只爲後完整性,可下載的英特爾®64和IA-32體系結構優化參考手冊here包含用於複數乘法(例6-9)和複數除法(例6-10)的程序集。

下面是例如乘法的代碼:

// Multiplication of (ak + i bk) * (ck + i dk) 
// a + i b can be stored as a data structure 
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0 
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0 
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0 
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0 
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0 
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0 
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0 

的組件直接映射到gccs X86 intrinsics(只是謂詞每個指令與__builtin_ia32_)。

4

intel優化參考中的算法不能正確處理輸入中的溢出和NaN

單個NaN中的數字的實部或虛部將錯誤地傳播到其他部分。

由於幾個無限運算(例如無限* 0)在NaN處結束,溢出可能導致NaN出現在您的其他行爲良好的數據中。

如果溢出NaN s爲罕見的,一個簡單的方法來避免這種情況是在結果只檢查NaN並與編譯器重新計算它符合IEEE執行:

float complex a[2], b[2]; 
__m128 res = simd_fast_multiply(a, b); 

/* store unconditionally, can be executed in parallel with the check 
* making it almost free if there is no NaN in data */ 
_mm_store_ps(dest, res); 

/* check for NaN */ 
__m128 n = _mm_cmpneq_ps(res, res); 
int have_nan = _mm_movemask_ps(n); 
if (have_nan != 0) { 
    /* do it again unvectorized */ 
    dest[0] = a[0] * b[0]; 
    dest[1] = a[1] * b[1]; 
}