快速計算__m128i寄存器中的設定位數

我應該計算__m128i寄存器的設定位數。特別是，我應該使用以下方法編寫兩個可以計算寄存器位數的函數。快速計算__m128i寄存器中的設定位數

寄存器的設定位總數。
寄存器每個字節的設置位數。

是否有內部函數可以完全或部分地執行上述操作？

2013-06-27 enzom83

最新的CPU還一個'POPCNT'（人口數）指令; GCC通過內置的['__builtin_popcount']（http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html）公開它。 –

請參閱http://graphics.stanford.edu/~seander/bithacks.html瞭解更多信息。 –

MS也有popcount函數...請參閱http://stackoverflow.com/questions/11114017/whats-the-difference-between-popcnt-and-mm-popcnt-u32 ...請注意，這些不一定比咬人;如果對數組中的位進行計數，某些bithack函數的速度會更快。 –

以下是我在舊項目中使用的一些代碼（there is a research paper about it）。下面的函數popcnt8計算每個字節中設置的位數。

僅SSE2版本（基於算法3在Hacker's Delight book）：

static const __m128i popcount_mask1 = _mm_set1_epi8(0x77); 
static const __m128i popcount_mask2 = _mm_set1_epi8(0x0F); 
static inline __m128i popcnt8(__m128i x) { 
    __m128i n; 
    // Count bits in each 4-bit field. 
    n = _mm_srli_epi64(x, 1); 
    n = _mm_and_si128(popcount_mask1, n); 
    x = _mm_sub_epi8(x, n); 
    n = _mm_srli_epi64(n, 1); 
    n = _mm_and_si128(popcount_mask1, n); 
    x = _mm_sub_epi8(x, n); 
    n = _mm_srli_epi64(n, 1); 
    n = _mm_and_si128(popcount_mask1, n); 
    x = _mm_sub_epi8(x, n); 
    x = _mm_add_epi8(x, _mm_srli_epi16(x, 4)); 
    x = _mm_and_si128(popcount_mask2, x); 
    return x; 
}

SSSE3版本（由於Wojciech Mula）：

static const __m128i popcount_mask = _mm_set1_epi8(0x0F); 
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4); 
static inline __m128i popcnt8(__m128i n) { 
    const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(n, popcount_mask)); 
    const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(n, 4), popcount_mask)); 
    return _mm_add_epi8(pcnt0, pcnt1); 
}

XOP版本（相當於SSSE3，但使用XOP指令這在AMD推土機上更快）

static const __m128i popcount_mask = _mm_set1_epi8(0x0F); 
static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4); 
static const __m128i popcount_shift = _mm_set1_epi8(-4); 
static inline __m128i popcount8(__m128i n) { 
    const __m128i pcnt0 = _mm_perm_epi8(popcount_table, popcount_table, _mm_and_si128(n, popcount_mask)); 
    const __m128i pcnt1 = _mm_perm_epi8(popcount_table, popcount_table, _mm_shl_epi8(n, popcount_shift)); 
    return _mm_add_epi8(pcnt0, pcnt1); 
}

功能下面離子popcnt64計數比特在SSE的低和高的64位部分的數目寄存器：

SSE2版本：

static inline __m128i popcnt64(__m128i n) { 
    const __m128i cnt8 = popcnt8(n); 
    return _mm_sad_epu8(cnt8, _mm_setzero_si128()); 
}

XOP版本：

static inline __m128i popcnt64(__m128i n) { 
    const __m128i cnt8 = popcnt8(n); 
    return _mm_haddq_epi8(cnt8); 
}

最後，函數popcnt128以下統計整個128位寄存器的位數：

static inline int popcnt128(__m128i n) { 
    const __m128i cnt64 = popcnt64(n); 
    const __m128i cnt64_hi = _mm_unpackhi_epi64(cnt64, cnt64); 
    const __m128i cnt128 = _mm_add_epi32(cnt64, cnt64_hi); 
    return _mm_cvtsi128_si32(cnt128); 
}

然而，爲了實現popcnt128更有效的方式是使用硬件POPCNT指令（對處理器，所以支持它）：

static inline int popcnt128(__m128i n) { 
    const __m128i n_hi = _mm_unpackhi_epi64(n, n); 
    #ifdef _MSC_VER 
     return __popcnt64(_mm_cvtsi128_si64(n)) + __popcnt64(_mm_cvtsi128_si64(n_hi)); 
    #else 
     return __popcntq(_mm_cvtsi128_si64(n)) + __popcntq(_mm_cvtsi128_si64(n_hi)); 
    #endif 
}

來源

2013-06-28 00:21:48

看起來像你是所提及的研究論文的共同作者之一:-)對於' n'paste船員以及。您的解決方案是最新的。哈克特技不再是最新的。榮譽，夥計！ –

哦，太糟糕了。您在ACM上發表了論文，因此我無法在不付15美元的情況下閱讀它:-( –

@NilsPipenbrinck，論文在會議網站上免費提供：meetings.computer.org/sc/2012/papers/1000a033。 pdf –

-2

編輯：我想我不明白OP在找什麼，但我保持我的回答，以防其他人在這個問題上有用。

C提供了一些很好的按位操作。

下面是代碼來計數的整數設置的位數：

countBitsSet(int toCount) 
{ 
    int numBitsSet = 0; 
    while(toCount != 0) 
    { 
     count += toCount % 2; 
     toCount = toCount >> 1; 
    } 
    return numBitsSet; 
}

說明：

toCount % 2

返回我們的整數的最後一位。（除以二並檢查餘數）。我們將其添加到總計數中，然後將我們的toCount值的位移1。此操作應持續進行，直到toCount中沒有更多位（當toCount等於0時）

要計算特定字節中的位數，您需要使用掩碼。這裏有一個例子：

countBitsInByte(int toCount, int byteNumber) 
{ 
    int mask = 0x000F << byteNumber * 8 
    return countBitsSet(toCount & mask) 
}

比方說，在我們的系統，我們認爲字節0的小端排序系統中至少顯著字節。我們想通過掩蓋設置爲0的位來創建一個新的toCount來傳遞給我們之前的countBitsSet函數。我們通過將一個滿字節的字節（用字母F表示）移動到我們想要的位置（byteNumber * 8代表一個字節中的8位），並使用我們的toCount變量執行按位與運算。

來源

2013-06-28 00:11:56

*有*內置（映射到CPU指令的內部函數，例如「POPCNT」），問題是關於在128位SSE（XMM）寄存器中計數置位，而不是在'int'中。 –

啊，我看到我並沒有完全理解這個問題。如果合適的話，我會編輯我的回覆並保持它的狀態，以防有人遇到這種情況。 –

C不提供「很好」的按位操作。你甚至不能進行算術右移！允許實現爲2的補碼，但是在簽名類型上有'>>'是邏輯轉換。但在實踐中，所有編譯人員實際上都希望使用這些編譯器，讓您在簽名類型上進行算術右移，因此，您的函數是負數「toCount」的無限循環。並且簽名的'％2'比'＆1'需要更多的工作，因爲它必須爲負的奇數產生'-1'。但是（在普通編譯器上），如果'toCount'爲負值，那麼函數永遠不會返回，所以問題隱藏起來...... –

正如在第一評論說，海灣合作委員會3.4+提供了一個易於訪問（希望最佳）通過

int __builtin_popcount (unsigned int x) /* Returns the number of 1-bits in x. */

內置正如這裏所說： http://gcc.gnu.org/onlinedocs/gcc-3.4.3/gcc/Other-Builtins.html#Other%20Builtins

並不完全回答128位的問題，而是給一個不錯的答案，我已經當我降落在這裏的問題:)

來源

2014-04-29 15:19:14 yota

這裏是關於Bit Twiddling Hacks - Counting Set Bits in Parallel一個版本鹼與命名類似的其他固有功能以及一些附加功能16個32和64位矢量

#include "immintrin.h" 

/* bit masks: 0x55 = 01010101, 0x33 = 00110011, 0x0f = 00001111 */ 
static const __m128i m1 = {0x5555555555555555ULL,0x5555555555555555ULL}; 
static const __m128i m2 = {0x3333333333333333ULL,0x3333333333333333ULL}; 
static const __m128i m3 = {0x0f0f0f0f0f0f0f0fULL,0x0f0f0f0f0f0f0f0fULL}; 
static const __m128i m4 = {0x001f001f001f001fULL,0x001f001f001f001fULL}; 
static const __m128i m5 = {0x0000003f0000003fULL,0x0000003f0000003fULL}; 

__m128i _mm_popcnt_epi8(__m128i x) { 
    /* Note: if we returned x here it would be like _mm_popcnt_epi1(x) */ 
    __m128i y; 
    /* add even and odd bits*/ 
    y = _mm_srli_epi64(x,1); //put even bits in odd place 
    y = _mm_and_si128(y,m1); //mask out the even bits (0x55) 
    x = _mm_subs_epu8(x,y); //shortcut to mask even bits and add 
    /* if we just returned x here it would be like _mm_popcnt_epi2(x) */ 
    /* now add the half nibbles */ 
    y = _mm_srli_epi64 (x,2); //move half nibbles in place to add 
    y = _mm_and_si128(y,m2); //mask off the extra half nibbles (0x0f) 
    x = _mm_and_si128(x,m2); //ditto 
    x = _mm_adds_epu8(x,y); //totals are a maximum of 5 bits (0x1f) 
    /* if we just returned x here it would be like _mm_popcnt_epi4(x) */ 
    /* now add the nibbles */ 
    y = _mm_srli_epi64(x,4); //move nibbles in place to add 
    x = _mm_adds_epu8(x,y); //totals are a maximum of 6 bits (0x3f) 
    x = _mm_and_si128(x,m3); //mask off the extra bits 
    return x; 
} 

__m128i _mm_popcnt_epi16(__m128i x) { 
    __m128i y; 
    x = _mm_popcnt_epi8(x); //get byte popcount 
    y = _mm_srli_si128(x,1); //copy even bytes for adding 
    x = _mm_add_epi16(x,y); //add even bytes into the odd bytes 
    return _mm_and_si128(x,m4);//mask off the even byte and return 
} 

__m128i _mm_popcnt_epi32(__m128i x) { 
    __m128i y; 
    x = _mm_popcnt_epi16(x); //get word popcount 
    y = _mm_srli_si128(x,2); //copy even words for adding 
    x = _mm_add_epi32(x,y); //add even words into odd words 
    return _mm_and_si128(x,m5);//mask off the even words and return 
} 

__m128i _mm_popcnt_epi64(__m128i x){ 
    /* _mm_sad_epu8() is weird 
     It takes the absolute difference of bytes between 2 __m128i 
     then horizontal adds the lower and upper 8 differences 
     and stores the sums in the lower and upper 64 bits 
    */ 
    return _mm_sad_epu8(_mm_popcnt_epi8(x),(__m128i){0}); 
} 

int _mm_popcnt_si128(__m128i x){ 
    x = _mm_popcnt_epi64(x); 
    __m128i y = _mm_srli_si128(x,8); 
    return _mm_add_epi64(x,y)[0]; 
    //alternative: __builtin_popcntll(x[0])+__builtin_popcntll(x[1]); 
}

來源

2018-02-01 17:04:22 technosaurus

爲什麼你需要在最初的步驟之後爲飽和的'adds'而不是常規的'add'？（儘管根據Agner Fog的指令表，「paddusb ''和'paddb'具有相同的性能，所以沒有足夠的理由來避免飽和添加，這真是令人驚訝。） –

快速計算__m128i寄存器中的設定位數

回答

相關問題