2014-12-06 48 views
1

林新在上證所內部函數,並會在使用該9AS此欣賞一些線索協助尚未大霧給我)如何重寫這段代碼上證所內在

我有這樣的代碼

for(int k=0; k<=n-4; k+=4) 
{ 

    int xc0 = 512 + ((idx + k*iddx)>>6); 
    int yc0 = 512 + ((idy + k*iddy)>>6); 

    int xc1 = 512 + ((idx + (k+1)*iddx)>>6); 
    int yc1 = 512 + ((idy + (k+1)*iddy)>>6); 

    int xc2 = 512 + ((idx + (k+2)*iddx)>>6); 
    int yc2 = 512 + ((idy + (k+2)*iddy)>>6); 

    int xc3 = 512 + ((idx + (k+3)*iddx)>>6); 
    int yc3 = 512 + ((idy + (k+3)*iddy)>>6); 

    unsigned color0 = working_buffer[yc0*working_buffer_size_x + xc0]; 
    unsigned color1 = working_buffer[yc1*working_buffer_size_x + xc1]; 
    unsigned color2 = working_buffer[yc2*working_buffer_size_x + xc2]; 
    unsigned color3 = working_buffer[yc3*working_buffer_size_x + xc3]; 

    int adr = base_adr + k; 

    frame_bitmap[adr] = color0; 
    frame_bitmap[adr+1]= color1; 
    frame_bitmap[adr+2]= color2; 
    frame_bitmap[adr+3]= color3; 
} 

都在這裏是int/unsigned,這是循環的關鍵部分,不確定整數sse是否會在速度上有所幫助,但不知道它是否會起作用?有人可以幫忙嗎?

(即時通訊使用的mingw32)

+0

你可以去混淆'working_buffer'的實際訪問模式嗎?所以,只需指數數學。這有點難以解碼。我仍不確定這是一個「奇怪的聚會」還是一種可以與之合作的模式。 – harold 2014-12-06 17:32:03

+0

看起來像一個「聚集」類型的操作,所以至少需要AVX2。 – 2014-12-06 17:53:15

+0

working_buffer是一個無符號顏色的紋理[] [] [數據有1024 x 1024,儘管可悲的是working_buffer低維度比1024更大 - 儘管如果非常需要我可以重寫某些代碼以使其僅爲無符號texture_bitmap [1024] [ 1024] – user2214913 2014-12-06 19:48:09

回答

1

我SSE是有點生疏,但你應該做的是:

xmm0: [k, k+1, k+2, k+3] //xc0, xc1,.... 
xmm1: [k, k+1, k+2, k+3] //yc0, yc1,.... 
//initialize before the loop 
xmm2: [512, 512, 512, 512] 
xmm3: [idx, idx, idx, idx] 
xmm4: [iddx, iddx, iddx, iddx] 
xmm5: [idy, idy, idy, idy] 
xmm6: [iddy, iddy, iddy, iddy] 
xmm7: [working_buffer_size_x, working_buffer_size_x, working_buffer_size_x, working_buffer_size_x] 

計算:

xmm0 * xmm4 
xmm0 + xmm3 
xmm0 >> 6 
xmm0 + xmm2 

xmm0: [xc0, xc1, xc2, xc3] 
/////////////////////////////// 

xmm1 * xmm6 
xmm1 + xmm5 
xmm1 >> 6 
xmm1 + xmm2 

xmm1: [yc0, yc1, yc2, yc3] 

xmm1 * xmm7 
xmm1 + xmm0 

現在xmm1是:

xmm1: [yc0*working_buffer_size_x + xc0, yc1*working_buffer_size_x + xc1, yc2*working_buffer_size_x + xc2, yc3*working_buffer_size_x + xc3] 

您正在讀寫每個循環(working_buffer,frame_bitmap數組)中的內存,這些操作比計算本身的速度要慢得多,所以速度的提升不會像預期的那麼大。

編輯

你需要working_buffer和frame_bitmap陣列被對準並且被SSE4.1

#include <emmintrin.h> 
#include <smmintrin.h> //SSE4.1 

int a[4] __attribute__((aligned(16))); 
__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7; 

xmm2 = _mm_set1_epi32(512); 
xmm3 = _mm_set1_epi32(idx); 
xmm4 = _mm_set1_epi32(iddx); 
xmm5 = _mm_set1_epi32(idy); 
xmm6 = _mm_set1_epi32(iddy); 
xmm7 = _mm_set1_epi32(working_buffer_size_x); 

for(k = 0; k <= n - 4; k +=4){ 
    xmm0 = _mm_set_epi32(k + 3, k + 2, k + 1, k); 
    xmm1 = _mm_set_epi32(k + 3, k + 2, k + 1, k); 

    //xmm0 * xmm4 
    xmm0 = _mm_mullo_epi32(xmm0, xmm4); 

    //xmm0 + xmm3 
    xmm0 = _mm_add_epi32(xmm0, xmm3); 

    //xmm0 >> 6 
    xmm0 = _mm_srai_epi32(xmm0, 6); 

    //xmm0 + xmm2 
    xmm0 = _mm_add_epi32(xmm0, xmm2); 



    //xmm1 * xmm6 
    xmm1 = _mm_mullo_epi32(xmm1, xmm6); 

    //xmm1 + xmm5 
    xmm1 = _mm_add_epi32(xmm1, xmm5); 

    //xmm1 >> 6 
    xmm1 = _mm_srai_epi32(xmm1, 6); 

    //xmm1 + xmm2 
    xmm1 = _mm_add_epi32(xmm1, xmm2); 


    //xmm1 * xmm7 
    xmm1 = _mm_mullo_epi32(xmm1, xmm7); 
    //xmm1 + xmm0 
    xmm1 = _mm_add_epi32(xmm1, xmm0); 


    //a[0] = yc0*working_buffer_size_x + xc0 
    //a[1] = yc1*working_buffer_size_x + xc1 
    //a[2] = yc2*working_buffer_size_x + xc2 
    //a[3] = yc3*working_buffer_size_x + xc3 
    _mm_store_si128((__m128i *)&a[0], xmm1); 

    unsigned color0 = working_buffer[ a[0] ]; 
    unsigned color1 = working_buffer[ a[1] ]; 
    unsigned color2 = working_buffer[ a[2] ]; 
    unsigned color3 = working_buffer[ a[3] ]; 

    int adr = base_adr + k; 

    frame_bitmap[adr] = color0; 
    frame_bitmap[adr+1]= color1; 
    frame_bitmap[adr+2]= color2; 
    frame_bitmap[adr+3]= color3; 
} 

您可以優化它更通過避免_mm_store_si128((__m128i *)&a[0], xmm1);int adr = base_adr + k;使用直接處理內存的程序集。

+0

好吧,我知道在這裏速度的提升可能並不大(如果有的話,如果它在所有的作品中都可以測試 – user2214913 2014-12-06 17:07:31

+0

ps這是工作的一部分,也許有人可以繼續進行下去?我知道的助記符就像__m128 a128 = _mm_load_ps(a); __m128 b128 = _mm_load_ps(b); __m128 out128 = _mm_div_ps(a128,b128); _mm_store_ps(out,out128); – user2214913 2014-12-06 17:09:48

+1

@ user2214913正如我所說,我在幾個月前寫過內在函數,所以我不記得它們。但你的例子非常簡單。我會看看我能做什麼。 – 2014-12-06 17:23:58