位技巧找到0的數量等於1的數量的第一個位置

找到最左邊位的索引i的最快方法是什麼，使得最左邊的i位中的0的數量等於最左邊的i位中的1的數量？我在想一些像here提到的一些技巧。

我對最近的x86_64處理器感興趣。這可能與某些處理器支持指令如POPCNT（計數1的數量）或LZCNT（計數前導0的數量）相關。

如果有幫助，可以假設第一位總是有一定的值。

實施例（有16位）：如果整數是

1110010100110110b 
     ^
     i

然後I = 10和它對應的標記的位置。

的可能（慢）實現16位整數可能是：

mask = 1000000000000000b 
pos = 0 
count=0 
do { 
    if(x & mask) 
     count++; 
    else 
     count--; 

    pos++; 
    x<<=1; 
} while(count) 

return pos;

編輯：在代碼中的bug固定按@njuffa評論。

來源

2016-12-02 Steven

而'i = 0'會被禁止，對不對？否則，它有點無聊 – harold

是的確的:)我必須在範圍1，...，大小其中大小是整數的位數。 – Steven

我不清楚規格。你能提供一個簡單的（慢）參考實現嗎？在我看來，這樣一個最左邊的位的位置不能總是被找到，一個簡單的32位示例是0xFFFFFFFE（或者結果是在這種情況下是32） – njuffa

我對此沒有任何技巧，但我確實有SIMD技巧。

一是一些意見，

解釋0爲-1，這個問題變得「找到第一個i，使得第一i位之和爲0」。
0是偶數，但所有位在這種解釋下都有奇數值，這給出了一個洞察，即i必須是偶數，並且這個問題可以通過2位塊來分析。
01和10不改變餘額。

擴展的2組出爲（無下面的測試）字節，

// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set 
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6), 
        _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15)); 
spread = _mm_and_si128(spread, _mm_set1_epi8(3));

-1 1，並且01和10由0替換00，11後：

__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1, 0,0,0,0,0,0,0,0,0,0,0,0), 
          spread);

計算前綴和：

__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1)); 
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2)); 
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4)); 
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));

找到最高0：

__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128()); 
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;

的<< 15和*2出現，因爲所得到的掩碼是16個比特，但CLZ是32位，它的移動一個較小，因爲如果頂部字節是零，指示1組的2取，不爲零。

來源

2016-12-02 14:24:15 harold

謝謝你的回答。我完全理解了解決方案的想法，但我仍然不確定指令在做什麼（並且我發現的文檔有點模糊）。你介意解釋還是指點我一個很好的參考？ – Steven

@Steven除了'pshufb'（'_mm_shuffle_epi8'）之外，他們中的大多數看起來相當無辜，那是什麼問題？ – harold

是的，我也不明白你爲什麼用_mm_setr_epi8調用的參數掩蓋。 – Steven

可能的解決方案（對於32位整數）。我不確定是否可以改進/避免使用查找表。這裏x是輸入整數。

//Look-up table of 2^16 elements. 
//The y-th is associated with the first 2 bytes y of x. 
//If the wanted bit is in y, LUT1[y] is minus the position of the bit 
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15) 
LUT1 = .... 

//Look-up talbe of 16 * 2^16 elements. 
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively. 
//y' is the number of excess ones in the first byte of x, minus 1 
//y'' is the second byte of x. The table contains the answer to return. 
LUT2 = .... 

if(LUT1[x>>16] < 0) 
    return -LUT1[x>>16]; 

return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]

這需要大約1MB的查找表。同樣的想法也可以使用4個查找表（x每個字節一個）。這需要更多的操作，但將內存降低到12KB。

LUT1 = ... //2^8 elements 
LUT2 = ... //8 * 2^8 elements 
LUT3 = ... //16 * 2^8 elements 
LUT3 = ... //24 * 2^8 elements 

y = x>>24 
if(LUT1[y] < 0) 
    return -LUT1[y]; 

y = (LUT1[y]<<8) | ((x>>16) & 0xFF); 
if(LUT2[y] < 0) 
    return -LUT2[y]; 

y = (LUT2[y]<<8) | ((x>>8) & 0xFF); 
if(LUT3[y] < 0) 
    return -LUT3[y]; 

return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];

來源

2016-12-02 14:27:48 Steven

我通過測量計算所有32位整數結果所需的總時間，以4個查找表與@ harold的解決方案爲基準對解決方案進行了基準測試。在我的機器上，這個解決方案需要約5600000個時鐘滴答，而harold需要約25000000個時鐘滴答（每個C的時鐘（）函數）。 – Steven

您可以使用'perf stat'輕鬆測量核心時鐘週期中的整個程序，而不是實時。如果您沒有內存瓶頸，則可以從等式中消除CPU頻率調整（speedstep/turbo）。（但是Intel CPU上的專用每核心二級緩存只有256kB，所以情況可能並非如此）。順便說一句，閱讀5.6M與25M相比更容易。此外，它可能會影響您測試的硬件以及是否使用AVX。一般來說編譯器選項很重要。 –

但是，微型基準標記使得2^16條目LUT外觀*方式比在真實程序中更好。在microbenchmark中，整個事物在L2中保持高溫（並且您使用的部分可能會在L1中保持熱點，這取決於輸入）。在一個真正的程序中，其他代碼會在調用此函數之間驅逐LUT。此外，您沒有說明您的微基準測試是否會測試延遲或吞吐量。 –

這是一個解決方案，使用傳統的位扭曲技術的32位數據。中間計算需要64位算術和邏輯運算。儘可能地嘗試堅持便攜式操作。要求是POSIX函數ffsll的一個實現，用於在64位long long中找到最低有效1位，以及一個自定義函數rev_bit_duos，該函數在32位整數中反轉位二進制。後者可以替換爲平臺特定的位反轉內部函數，例如ARM平臺上的__rbit intrinsic。

基本的觀察結果是，如果可以提取具有相同數量的0位和1位的位組，它必須包含偶數個位。這意味着我們可以檢查2位組中的操作數。我們可以進一步限制自己跟蹤每個2位增加（0b11），減少（0b00）還是保持不變（0b01,0b10）位的運行平衡。如果我們用單獨的計數器計數正負值變化，4位計數器就足夠了，除非輸入是0或0xffffffff，可以單獨處理。根據對問題的評論，這些情況不應該發生。通過從每個2位組的正變化計數中減去負變化計數，我們可以找到哪個組中的餘額變爲零。可能有多個這樣的位組，我們需要找到第一個。

處理可以並行expanding each 2-bit group into a nibble，然後可以作爲更改計數器。前綴和可以通過整數乘以適當的常數來計算，其提供了在每個半字節位置處的必要的移位增加操作。並行半字節減法的有效方法是衆所周知的，同樣存在衆所周知的technique due to Alan Mycroft for detecting zero-bytes，其可以輕微地改變爲零半字節檢測。然後應用POSIX函數ffsll來查找該半字節的位位置。

輕微問題是一個最左邊的位組的提取需求，而不是最右邊的，因爲艾倫Mycroft的伎倆只能尋找從權第一零四位。另外，處理最左邊位組的前綴和需要使用可能不容易獲得的mulhi操作，並且可能比標準整數乘法效率低。我已經解決了這兩個問題，只需簡單地對原始操作數進行位反轉即可。

#include <stdio.h> 
#include <stdlib.h> 
#include <stdint.h> 
#include <string.h> 

/* Reverse bit-duos using classic binary partitioning algorithm */ 
inline uint32_t rev_bit_duos (uint32_t a) 
{ 
    uint32_t m; 
    a = (a >> 16) | (a << 16);       // swap halfwords 
    m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes 
    m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles 
    m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos 
    return a; 
} 

/* Return the number of most significant (leftmost) bits that must be extracted 
    to achieve an equal count of 1-bits and 0-bits in the extracted bit group. 
    Return 0 if no such bit group exists. 
*/ 
int solution (uint32_t x) 
{ 
    const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words 
    const uint64_t mask8 = 0x00ff00ff00ff00ffULL; // alternate bytes 
    const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo 
    const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo 
    const uint64_t nibble_lsb = 0x1111111111111111ULL; 
    const uint64_t nibble_msb = 0x8888888888888888ULL; 
    uint64_t a, b, r, s, t, expx, pc_expx, nc_expx; 
    int res; 

    /* common path can't handle all 0s and all 1s due to counter overflow */ 
    if ((x == 0) || (x == ~0)) return 0; 

    /* make zero-nibble detection work, and simplify prefix sum computation */ 
    x = rev_bit_duos (x); // reverse bit-duos 

    /* expand each bit-duo into a nibble */ 
    expx = x; 
    expx = ((expx << 16) | expx) & mask16; 
    expx = ((expx << 8) | expx) & mask8; 
    expx = ((expx << 4) | expx); 
    expx = ((expx & mask4h) * 4) + (expx & mask4l); 

    /* compute positive and negative change counts for each nibble */ 
    pc_expx = expx & (expx >> 1) & nibble_lsb; 
    nc_expx = ~expx & (~expx >> 1) & nibble_lsb; 

    /* produce prefix sums for positive and negative change counters */ 
    a = pc_expx * nibble_lsb; 
    b = nc_expx * nibble_lsb; 

    /* subtract positive and negative prefix sums, nibble-wise */ 
    s = a^~b; 
    r = a | nibble_msb; 
    t = b & ~nibble_msb; 
    s = s & nibble_msb; 
    r = r - t; 
    r = r^s; 

    /* find first nibble that is zero using Alan Mycroft's magic */ 
    r = (r - nibble_lsb) & (~r & nibble_msb); 
    res = ffsll (r)/2; // account for bit-duo to nibble expansion 

    return res; 
} 

/* Return the number of most significant (leftmost) bits that must be extracted 
    to achieve an equal count of 1-bits and 0-bits in the extracted bit group. 
    Return 0 if no such bit group exists. 
*/ 
int reference (uint32_t x) 
{ 
    int count = 0; 
    int bits = 0; 
    uint32_t mask = 0x80000000; 
    do { 
     bits++; 
     if (x & mask) { 
      count++; 
     } else { 
      count--; 
     } 
     x = x << 1; 
    } while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT))); 
    return (count) ? 0 : bits; 
} 

int main (void) 
{ 
    uint32_t x = 0; 
    do { 
     uint32_t ref = reference (x); 
     uint32_t res = solution (x); 
     if (res != ref) { 
      printf ("x=%08x res=%u ref=%u\n\n", x, res, ref); 
     } 
     x++; 
    } while (x); 
    return EXIT_SUCCESS; 
}

來源

2016-12-02 23:02:37 njuffa

我還沒有找到你的全部算法。如果你有一個尾隨零計數，也就是從另一端算起的「ffsll」，你能避免位反轉嗎？（包括x86在內的許多平臺在asm中都有這樣的功能，但我忘記了它是否具有可移植的C函數。）或者，您是否需要位才能在multiplis中進行傳播？ –

零啃檢測的Mycroft技術*可以標記多個零半字節，但由於可能的進位傳播，它只能以100％的準確度標記最右（最低有效）的零半字節。我最初忘記了這一點，直接使用輸入（沒有初始反轉），使用'clz'從左邊（最高有效位）搜索零位，只發現我的代碼失敗了很多參數> 0xc0000000' 。另外，非反轉處理需要64位'mulhi'來生成前綴和，這通常必須用內聯或內聯彙編進行編碼。 – njuffa

位技巧找到0的數量等於1的數量的第一個位置

回答

相關問題