2014-10-20 37 views
0

我的程序大量使用從較大灰度圖像中使用雙線性插值提取的小型子圖像。灰度雙線性補丁提取-SSE優化

我使用下面的函數用於此目的:

bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch) 
{ 
    const int hsize = patch.rows/2; 

    // ... 
    // Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers 
    // ... 

    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize; 
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows) 
     return false; 

    float x=patch_ctr.x-hsize-floorx; 
    float y=patch_ctr.y-hsize-floory; 
    float xy = x*y; 
    float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy; 
    int img_stride = img.cols-patch.cols; 
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx; 
    uchar* buff_img1 = buff_img0+img.cols; 
    uchar* buff_patch = (uchar*)patch.data; 
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) { 
     for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1) 
      buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11); 
    } 
    return true; 
} 

長話短說,我已經使用在程序的其他部分並行,而且我在使用SSE優化該功能的執行考慮,因爲我大多使用8x8補丁,並且使用SSE一次處理8個像素串似乎是一個好主意。

但是,我不知道怎麼用float插值加權(即w00w01w10w11應對乘法。這些權重必然是積極且小於1,因此乘法不能溢出unsigned char數據類型。

有誰知道如何着手


編輯:

我試圖這樣做如下(假設16×16補丁),但沒有顯著加速:

bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch) 
{ 
    // ... 
    // Precondition checks 
    // ... 

    const int hsize = patch.rows/2; 
    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize; 
    // Check that the full extracted patch is inside the image 
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows) 
     return false; 

    // Compute the constant bilinear weights 
    float x=patch_ctr.x-hsize-floorx; 
    float y=patch_ctr.y-hsize-floory; 
    float xy = x*y; 
    float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy; 
    // Prepare image resampling loop 
    int img_stride = img.cols-patch.cols; 
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx; 
    uchar* buff_img1 = buff_img0+img.cols; 
    uchar* buff_patch = (uchar*)patch.data; 
    // Precompute weighting variables 
    const __m128i CONST_0 = _mm_setzero_si128(); 
    __m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256)); 
    __m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256)); 
    __m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256)); 
    __m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256)); 
    __m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i); 
    __m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i); 
    __m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i); 
    __m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i); 
    // Process pixels 
    int ngroups = patch.rows>>4; 
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) { 
     for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) { 
       //////////////////////////////// 
       // Load the data (16 pixels in one load) 
       //////////////////////////////// 
       __m128i val00 = _mm_loadu_si128((__m128i*)buff_img0); 
       __m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1)); 
       __m128i val10 = _mm_loadu_si128((__m128i*)buff_img1); 
       __m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1)); 
       //////////////////////////////// 
       // Process the lower 8 values 
       //////////////////////////////// 
       // Unpack into 16-bits integers 
       __m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0); 
       __m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0); 
       __m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0); 
       __m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0); 
       // Multiply with the integer weights 
       __m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i); 
       __m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i); 
       __m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i); 
       __m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i); 
       // Divide by 256 to get the approximate result of the multiplication with floating-point weights 
       __m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8); 
       __m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8); 
       __m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8); 
       __m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8); 
       // Add pairwise 
       __m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo); 
       __m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo); 
       __m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo); 
       //////////////////////////////// 
       // Process the higher 8 values 
       //////////////////////////////// 
       // Unpack into 16-bits integers 
       __m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0); 
       __m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0); 
       __m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0); 
       __m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0); 
       // Multiply with the integer weights 
       __m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i); 
       __m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i); 
       __m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i); 
       __m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i); 
       // Divide by 256 to get the approximate result of the multiplication with floating-point weights 
       __m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8); 
       __m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8); 
       __m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8); 
       __m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8); 
       // Add pairwise 
       __m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi); 
       __m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi); 
       __m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi); 
       //////////////////////////////// 
       // Repack all values 
       //////////////////////////////// 
       __m128i final_val = _mm_packus_epi16(final_lo,final_hi); 
       _mm_storeu_si128((__m128i*)buff_patch,final_val); 
     } 
    } 
} 

任何想法,什麼可以做,以提高加速?

回答

2

我會考慮堅持整數:你的權重是1/64的倍數,因此使用8.6的定點就足夠了,並且適合16位數字。

雙線性插值最好是以三個線性插值(Y上的兩個然後X上的一個;您可以重複使用第二個Y插值作爲相鄰的插值)。

要在兩個值之間執行線性插值,您將爲所有插值權重P和Q(8到1和0到7)預先存儲一次,並將它們成對地相乘並相加,如V0.P [i ] + V1.Q [i]中。這是使用PMADDUBSW指令有效完成的。 (在適當的數據交織之後,以及將值V0和V1與PUNPCKLBW等複製)。

最後除以總重量(PSRLW),重新調整爲字節(PACKUSWB)。 (這個步驟只能執行一次,結合兩個插值。)

您可以想到將所有權重加倍,因此最終的縮放比例是8位,而PACKUSWB就足夠了,但不幸的是它使數值飽和並且在那裏不是不飽和的等價物。

這可能是預計算所有64個插值權重和求和四個雙線性項更好。

UPDATE:

如果目標是與所有像素的四邊形(實際上是實現子像素翻譯)固定係數進行插值,該策略是不同的。

您將加載一個8(16?)對應於左上角的像素,向右移動一個像素(對應於右上角)的8行,以及對於下一行(底部錐體)類似;將成對的像素值(PMADDUBSW)相乘並相加成相應的插值權重,併合並這些對(PADDW)。用複製存儲權重。

另一種選擇是避免(PMADD)並執行單獨的乘法(PMULLW)和增加(PADDW)。這將簡化重組計劃。

縮放後(如上所示),最終運行8個插值。

只要每插入一個像素一個像素,這對於可變插值權重也可以。

+0

Ooops,這個原理涉及到一個x8放大因子,可能不是你想要的... – 2014-10-20 19:02:41

+0

好吧,我不明白爲什麼權重必須是1./64的倍數:)另外,我只有一個圖像,我想從中提取一個8x8的補丁。我認爲可以通過將權重前乘以256,將權重x 8位值乘積存儲在16位整數中,並在末尾再次除以256來完成。雖然我不知道如何準確高效地做到這一點...... – AldurDisciple 2014-10-20 19:11:39

+0

你的縮放係數是多少? – 2014-10-20 19:16:38