什麼是最快的方式把16位整數陣列中的8位整數

-2

我正在處理圖像的程序。如果我可以存儲16位整數的RGBA值，我可以使用SSE提高性能（沒有溢出風險）。然而，從8位整數到16位整數的轉換是瓶頸。什麼是把符號的8個整數爲16位整數數組，什麼是最快的方式把16位整數陣列中的8位整數

int8_t a[128]; 
int16_t b[128]; 

for (int i=0;i<128;i++) 
     b[i]=a[i];

我使用OpenMP和指針的有效等價的最快方式。

來源

2016-02-28 Misery

你有沒有看過循環矢量化？這幾乎是一個教科書的例子。 –

@MichaelAlbers不是。不應該自動編譯矢量化？ – Misery

也許，也許不是。海灣合作委員會將在-03，但否則你必須告訴它。 –

我做了一些測量，並在我的（相當嘈雜的）臺式機上運行3.1GHz的AMD CPU。我對AMD的緩存策略並不太瞭解，但是對此我們不應該太在意。

下面的代碼：gist of test.cpp 我-02使用GCC編譯它4.92

結果：

original: 0.0905usec 
aligned64: 0.1191usec 
unrolled_8s: 0.0625usec 
unrolled_64s: 0.0497usec

原 - 原密碼
aligned64 - 我想也許對齊是一個問題，所以我強迫它進入* 64位對齊。這不是問題。
unrolled_8s - 將128個循環展開爲八個組。
unrolled_64s - 攤開128環進入64

我的CPU爲3.1GHz CPU運行組，讓我們假設它是每秒約3十億週期，所以這是每納秒約3個週期。

original：90nsec〜270個週期。因此（270/128）=每個拷貝2.11個循環
對準64：119nsec〜357個循環。因此（一百二十八分之三百五十七）= 2.79個循環每拷貝
unrolled_8s：62毫微秒〜186個週期。因此（一百二十八分之一百八十六）=每拷貝1.45週期
unrolled_64s：50毫微秒〜150個循環。因此，（128分之267）=每份

1.17次請不要只是一味地認爲展開循環會更好！我被騙巨資在這裏通過濫用兩件事情：

所有數據保留在緩存

所有的指令（代碼）留在緩存

如果所有的數據是越來越無效了CPU的的緩存，你可能會在從主內存中重新獲取它的過程中付出可怕的代價。在最壞的情況下，執行復制的線程可能會從每個副本之間的CPU（「上下文切換」）中拋出。最重要的是，數據可能會從緩存中失效。這意味着您需要爲每個上下文切換支付數百微妙的時間，並且每個內存訪問需要數百個週期。

來源

2016-02-28 22:56:40 scraimer

鏘將矢量化此代碼與-O2

#include <cstdlib> 
#include <cstdint> 
#include <cstdio> 

const int size = 128; 
uint8_t a[size]; 
int16_t b[size]; 


static __inline__ unsigned long long rdtsc(void) 
{ 
    unsigned hi, lo; 
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); 
    return ((unsigned long long)lo)|(((unsigned long long)hi)<<32); 
} 


void convert(uint8_t* src, int16_t* dest) 
{ 
    for (int i=0;i<size;i++) 
    dest[i]=src[i]; 
} 

int main() 
{ 
    int sum1 = 0; 
    int sum2 = 0; 
    for(int i = 0; i < size; i++) 
    { 
     a[i] = rand(); 
     sum1 += a[i]; 
    } 
    auto t = rdtsc(); 
    convert(a, b); 
    t = rdtsc() - t; 
    for(int i = 0; i < size; i++) 
    { 
     sum2 += b[i]; 
    } 

    printf("%d = %d\n", sum1, sum2); 
    printf("t=%llu\n", t); 
}

這是通過鐺++生成的代碼。

; The loop inlined from `convert` as a single pass. 
    #APP 
    rdtsc 
    #NO_APP 
    movl %eax, %esi 
    movl %edx, %ecx 
    movq a(%rip), %xmm1 
    movq a+8(%rip), %xmm2 
    pxor %xmm0, %xmm0 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b(%rip) 
    movdqa %xmm2, b+16(%rip) 
    movq a+16(%rip), %xmm1 
    movq a+24(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+32(%rip) 
    movdqa %xmm2, b+48(%rip) 
    movq a+32(%rip), %xmm1 
    movq a+40(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+64(%rip) 
    movdqa %xmm2, b+80(%rip) 
    movq a+48(%rip), %xmm1 
    movq a+56(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+96(%rip) 
    movdqa %xmm2, b+112(%rip) 
    movq a+64(%rip), %xmm1 
    movq a+72(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+128(%rip) 
    movdqa %xmm2, b+144(%rip) 
    movq a+80(%rip), %xmm1 
    movq a+88(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+160(%rip) 
    movdqa %xmm2, b+176(%rip) 
    movq a+96(%rip), %xmm1 
    movq a+104(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+192(%rip) 
    movdqa %xmm2, b+208(%rip) 
    movq a+112(%rip), %xmm1 
    movq a+120(%rip), %xmm2 
    punpcklbw %xmm0, %xmm1 
    punpcklbw %xmm0, %xmm2 
    movdqa %xmm1, b+224(%rip) 
    movdqa %xmm2, b+240(%rip) 
    #APP 
    rdtsc 
    #NO_APP

對於較大的尺寸，由於編譯器不會內聯到無限大小，所以需要多一點。

gcc只向量化，沒有-O3的更多選項，但是它會生成類似的代碼。

但是，如果您使用-ftree-vectorize，gcc也會在-O2中生成SSE指令。

來源

2016-02-28 22:03:30

什麼是最快的方式把16位整數陣列中的8位整數

回答

相關問題