相當於SSE內在函數的霓虹燈

我正在嘗試將c代碼轉換爲使用霓虹內在函數的優化函數。相當於SSE內在函數的霓虹燈

這裏是操作超過2個操作符而不是操作向量的c代碼。

uint16_t mult_z216(uint16_t a,uint16_t b){ 
unsigned int c1 = a*b; 
    if(c1) 
    { 
     int c1h = c1 >> 16; 
     int c1l = c1 & 0xffff; 
     return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff; 
    } 
    return (1-a-b) & 0xffff; 
}

此操作的SEE優化版本已經被執行以下操作：

#define MULT_Z216_NEON(a, b, out) \ 
    temp = vorrq_u16 (*a, *b); \ 
    // ?? 
    // ?? 
    *b = vsubq_u16(*out, *a); \ 
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \ 
    *b = vshrq_n_u16(*b, 15); \ 
    *out = vsubq_s16(*out, *a); \ 
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \ 
    *c = vaddq_s16(*c, *b); \ 
    *temp = vandq_u16(*temp, *a); \ 
    *out = vsubq_s16(*out, *a);

我：

#define MULT_Z216_SSE(a, b, c) \ 
    t0 = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers 
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers 
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates 
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits 
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a. 
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b. 
    t0 = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

我使用NEON內在幾乎轉換這一塊只丟失了_mm_mullo_epi16 ((a), (b));和_mm_mulhi_epu16 ((a), (b));的霓虹等值。要麼我誤解了某些東西，要麼在NEON中沒有這種內在的東西。如果沒有相同的方法來使用NEONS內在函數來歸檔這些步驟？

UPDATE：

我已經忘記強調以下點：該函數的operants是uint16x8_t NEON矢量（每個元素是0和65535之間的uint16_t =>整數）。在答案中有人提出使用固有的vqdmulhq_s16()。這個函數的使用與給定的實現不匹配，因爲乘法內在函數會將向量解釋爲帶符號的值併產生錯誤的輸出。

來源

2012-07-02 Kami

如果您的值> 32767，那麼您需要使用下面建議的擴展乘法（vmull_u16）。如果你知道你的值都是<32768，那麼你可以使用vqdmulhq_s16。 – BitBank

您可以使用：

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t)

它返回的32種產品的載體。如果您想將結果分解爲高和低部分，則可以使用NEON解壓縮內部函數。

來源

2012-07-02 18:30:50

該指令是一個16x16 = 32的乘法（加寬輸出）。有更接近的說明（請參閱我的答案）。 – BitBank

@BBBank：OP需要更高的16位和更低的16位，因此他需要一個32位的結果。倍增/飽和乘法不能代替，因爲你失去了精度。 –

vmulq_s16（）相當於_mm_mullo_epi16。沒有確切的等價物_mm_mulhi_epu16;最接近的指令是vqdmulhq_s16（），它是「飽和，倍增，倍增，返回高部分」。它僅對帶符號的16位值進行操作，您需要將輸入或輸出除以2以使加倍無效。

來源

2012-07-02 22:02:13 BitBank

由於vqdmulhq_s16（）使用有符號的輸入，GCC抱怨錯誤的類型參數...如何轉換從uint16x8_t到int16x8_t高效的方式？ – Kami

有鑄造宏;使用vreinterpretq_s16_u16（） – BitBank

請參閱我的關於帶符號乘法的編輯！ – Kami

相當於SSE內在函數的霓虹燈

回答

相關問題