快速矢量化轉換BGRA

在後續的RGB轉換爲RGBA以前的一些問題，對ARGB BGR，我想加快RGB到BGRA轉換與SSE。假定一個32位機器，並且想使用內部函數。我很難對齊源緩衝區和目標緩衝區來處理128位寄存器，並尋找其他精明的矢量化解決方案。快速矢量化轉換BGRA

例行要矢量是如下...

void RGB8ToBGRX8(int w, const void *in, void *out) 
    { 
     int i; 
     int width = w; 
     const unsigned char *src= (const unsigned char*) in; 
     unsigned int *dst= (unsigned int*) out; 
     unsigned int invalue, outvalue; 

     for (i=0; i<width; i++, src+=3, dst++) 
     { 
       invalue = src[0]; 
       outvalue = (invalue<<16); 
       invalue = src[1]; 
       outvalue |= (invalue<<8); 
       invalue = src[2]; 
       outvalue |= (invalue); 
       *dst = outvalue | 0xff000000; 
     } 
     }

這個程序被使用primarly大型紋理（512KB），所以如果我可以並行化的一些操作的，它可能是有利於處理越來越多的像素。當然，我需要配置文件。 :)

編輯：

我的編譯參數...

gcc -O2 main.c

來源

2011-08-25 Rev316

您是否使用了編譯器的優化標誌（哪一個？）？編譯器通常會更好地優化代碼，而不會引入錯誤。你收集了哪些基準數據？ –

不是SSE的答案，但你有沒有嘗試展開你的循環4次，使得輸入總是從一個對齊的地址開始？然後，您可以逐字讀取輸入的機器字，並針對源像素的每個相對位置使用專門的移位和掩碼。正如Dana提到的那樣，值得一看的是編譯器在高優化級別上執行得如何（除了基準測試之外還檢查生成的彙編代碼），但是我懷疑它是否足夠積極展開循環_並且根據「in」全部由它自己對齊。 –

偉大的問題。它只是「O2」（不是O3）和GCC4.6。我的基準情況是以512作爲「寬度」跨度的10K迭代運行。感謝您的好評！ – Rev316

這是使用SSE3內部函數執行請求的操作的示例。輸入和輸出指針必須是16字節對齊的，並且一次對16個像素塊進行操作。

雖然我不認爲你會得到顯着的提速。對像素執行的操作非常簡單，以至於內存帶寬占主導地位。

#include <tmmintrin.h> 

/* in and out must be 16-byte aligned */ 
void rgb_to_bgrx_sse(unsigned w, const void *in, void *out) 
{ 
    const __m128i *in_vec = in; 
    __m128i *out_vec = out; 

    w /= 16; 

    while (w-- > 0) { 
     /*    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
     * in_vec[0] Ra Ga Ba Rb Gb Bb Rc Gc Bc Rd Gd Bd Re Ge Be Rf 
     * in_vec[1] Gf Bf Rg Gg Bg Rh Gh Bh Ri Gi Bi Rj Gj Bj Rk Gk 
     * in_vec[2] Bk Rl Gl Bl Rm Gm Bm Rn Gn Bn Ro Go Bo Rp Gp Bp 
     */ 
     __m128i in1, in2, in3; 
     __m128i out; 

     in1 = in_vec[0]; 

     out = _mm_shuffle_epi8(in1, 
      _mm_set_epi8(0xff, 9, 10, 11, 0xff, 6, 7, 8, 0xff, 3, 4, 5, 0xff, 0, 1, 2)); 
     out = _mm_or_si128(out, 
      _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0)); 
     out_vec[0] = out; 

     in2 = in_vec[1]; 

     in1 = _mm_and_si128(in1, 
      _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0, 0, 0, 0, 0, 0, 0, 0)); 
     out = _mm_and_si128(in2, 
      _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff)); 
     out = _mm_or_si128(out, in1); 
     out = _mm_shuffle_epi8(out, 
      _mm_set_epi8(0xff, 5, 6, 7, 0xff, 2, 3, 4, 0xff, 15, 0, 1, 0xff, 12, 13, 14)); 
     out = _mm_or_si128(out, 
      _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0)); 
     out_vec[1] = out; 

     in3 = in_vec[2]; 
     in_vec += 3; 

     in2 = _mm_and_si128(in2, 
      _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0, 0, 0, 0, 0, 0, 0, 0)); 
     out = _mm_and_si128(in3, 
      _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff)); 
     out = _mm_or_si128(out, in2); 
     out = _mm_shuffle_epi8(out, 
      _mm_set_epi8(0xff, 1, 2, 3, 0xff, 14, 15, 0, 0xff, 11, 12, 13, 0xff, 8, 9, 10)); 
     out = _mm_or_si128(out, 
      _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0)); 
     out_vec[2] = out; 

     out = _mm_shuffle_epi8(in3, 
      _mm_set_epi8(0xff, 13, 14, 15, 0xff, 10, 11, 12, 0xff, 7, 8, 9, 0xff, 4, 5, 6)); 
     out = _mm_or_si128(out, 
      _mm_set_epi8(0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0, 0xff, 0, 0, 0)); 
     out_vec[3] = out; 

     out_vec += 4; 
    } 
}

來源

2011-08-26 23:35:20 caf

我沒有的你問了一個全面的瞭解，而且我熱切地等待一個適當的迴應對你的問題。與此同時，我提出的實施方案平均速度大約快8到10％。我使用VS2010運行Win7 64bit，使用快速選項與C++進行編譯。

#pragma pack(push, 1) 
    struct RGB { 
     unsigned char r, g, b; 
    }; 

    struct BGRA { 
     unsigned char b, g, r, a; 
    }; 
#pragma pack(pop) 

    void RGB8ToBGRX8(int width, const void* in, void* out) 
    { 
     const RGB* src = (const RGB*)in; 
     BGRA* dst = (BGRA*)out; 
     do {   
      dst->r = src->r; 
      dst->g = src->g; 
      dst->b = src->b; 
      dst->a = 0xFF; 
      src++; 
      dst++; 
     } while (--width); 
    }

這可能會也可能不會幫助，但我希望它能。如果不這樣做，請不要投票給我，我只是試圖將其推向前進。

我使用結構的動機是讓編譯器儘可能高效地提前指針src和dst。另一個動機是限制算術運算的數量。

來源

2011-08-25 18:28:13 Jack

不用擔心傑克！如果你能澄清你可能不瞭解的哪一部分，我可以嘗試和闡述。 :) – Rev316

使用SSE是什麼意思？我認爲這意味着指示編譯器使用特定的優化技術，如果是這種情況，也許它不值得手動調整代碼。你也說你想使用內在的東西，你的意思是什麼？但是，我很好地掌握並行化。 – Jack

哦。我指的是使用SSE2/3或SSSEE的矢量化特性。大部分是填充/遮罩操作，因爲我已經看到了其他圖像轉換的優雅解決方案。現在，我知道GCC4.x有幾個編譯標誌在這裏有幫助，但我不確定哪個和/或哪個更好。也許你的專業知識在這裏會有幫助。 – Rev316

我個人發現，執行以下操作給了我將BGR-24轉換爲ARGB-32的最佳結果。

該代碼在圖像上運行時間約爲8.8ms，而上面介紹的128位矢量化代碼以每個圖像14.5ms運行。

void PixelFix(u_int32_t *buff,unsigned char *diskmem) 
{ 
    int i,j; 
    int picptr, srcptr; 
    int w = 1920; 
    int h = 1080; 

    for (j=0; j<h; j++) { 
     for (i=0; i<w; i++) { 
      buff[picptr++]=(diskmem[srcptr]<<24) | (diskmem[srcptr+1]<<16) | diskmem[srcptr+2]<<8 | 0xff; 
      srcptr+=3; 
     } 
    } 
}

以前，我一直在使用這個例程（每個圖像大約13.2ms）。這裏，buff是一個無符號的char *。

for (j=0; j<h; j++) { 
    int srcptr = (h-j-1)*w*3; // remove if you don't want vertical flipping 
    for (i=0; i<w; i++) { 
     buff[picptr+3]=diskmem[srcptr++]; // b 
     buff[picptr+2]=diskmem[srcptr++]; // g 
     buff[picptr+1]=diskmem[srcptr++]; // r 
     buff[picptr+0]=255;    // a 
     picptr+=4; 
    } 
}

運行2012年MacMini 2.6ghz/i7。

來源

2013-08-26 03:47:25 zzyzy

此外，有人可能希望看看蘋果最近的vImage轉換API ......，特別是用於從24位RGB轉換爲32位ARGB（或BGRA）的「vImageConvert_RGB888toARGB8888」等例程。 https://developer.apple.com/library/mac/documentation/Performance/Reference/vImage_conversion/Reference/reference.html#//apple_ref/c/func/vImageConvert_RGB888toARGB8888 – zzyzy

嗯...使用vImageConvert_RGB888toARGB8888非常非常快（15倍加速）。

以上PixelFix代碼（在較新的硬件每個圖像≈6ms，現在）

6.373520毫秒
6.383363毫秒
6.413560毫秒
6.278606毫秒
6.293607毫秒
6.368118 ms
6.338904毫秒
6.389385毫秒
6.365495毫秒

使用vImageConvert_RGB888toARGB888，螺紋（在更新的硬件）

0.563649毫秒
0.400387毫秒
0.375198毫秒
0.360898毫秒
0.391278毫秒
0.396797毫秒
0.405534毫秒
0.386495毫秒
0.367621毫秒

我多說嗎？

來源

2014-06-10 17:34:15 zzyzy

一個後續...使用單線程上面的128位向量代碼「rgb_to_bgrx_sse」給出了相同大小的I/O緩衝區在11ms範圍內的結果。 vImage在這裏是明顯的贏家。 – zzyzy

快速矢量化轉換BGRA

回答

相關問題