組合布隆過濾器

我使用布隆過濾器來檢查集合中的重複數據。但是，需要將兩組數據的結果組合成一個過濾器，以檢查兩組數據之間是否存在重複。我設計的僞Python的一個函數來執行此任務：組合布隆過濾器

def combine(a : bloom_filter, b : bloom_filter): 
    assert a.length == b.length 
    assert a.hashes == b.hashes 

    c = new bloom_filter(length = a.length, hashes = b.hashes) 
    c.attempts = a.attempts + b.attempts 
    c.bits = a.bits | b.bits 

    # Determining the amount of items 
    a_and_b = count(a & b) 
    a_not_b = count(a & !b) 
    not_a_b = count(!a & b) 
    neither = count(!a & !b) 
    c.item_count = a_not_b/a.length * a.item_count 
       + not_a_b/b.length * b.item_count 
       + a_and_b/c.length * min(a.item_count, b.item_count) 

    return c

這是否聽起來甚至正確？關於是否甚至有可能做我想做的事情，我有很大的內部爭議，因爲關於源數據的大部分信息都丟失了（這是布隆過濾器的要點）。

來源

2011-05-23 Travis Gockel

可以推導出公式用於估計物品的量的布隆過濾器：

c = log(z/N)/((h * log(1 - 1/N)) 

N: Number of bits in the bit vector 
h: Number of hashes 
z: Number of zero bits in the bit vector

這提供在布隆過濾器的項目數的相當準確的估計。你可以用簡單的減法來估計貢獻。

來源

2011-06-05 18:48:55

，便有可能.....那種..

可以說，設置一個包含蘋果和桔子

可以說集B含有豌豆和胡蘿蔔

構建一個簡單的16位布隆過濾器作爲一個例子和CRC32作爲哈希

crc32(apples) = 0x70CCB02F 

crc32(oranges) = 0x45CDF3B4 

crc32(peas) = 0xB18D0C2B 

crc32(carrots) = 0x676A9E28

開始瓦特/空布隆過濾器（BF）（比方說16個比特）爲兩組（A，B）

BFA = BFB = 0000 0000 0000 0000

然後，把hash分成一些比特長度，我們將在這裏使用4這裏我們可以給BF添加蘋果。例如

Get Apples BF Index list by splitting up the hash: 

0x70CCB02F = 0111 0000 1100 1100 1011 0000 0010 1111 
      7  0 C C B  0 2  F  
---------------------------------------------------- 

Add Apples to BFA by setting BF bit indexes [ 7, 0, 12, 12, 11, 0, 2, 15] 

           (set the index bit of an empty BF to 1) 
Apples =  1001 1000 1000 0101 (<- see indexes 0,2,7,11,12,15 are set) 
BF =   0000 0000 0000 0000 (or operation adds that item to the BF) 
================================ 
Updated BFA = 1001 1000 1000 0101

加入橙子至bf同樣的方式：

0x45CDF3B4 = 0100 0101 1100 1101 1111 0011 1011 0100 
       4 5 12 13 15 3 11 4 
---------------------------------------------------- 
Add oranges to BF by setting BF bit indexes [ 4,5,12,13,15,3,11,4] 

Oranges =  1011 1000 0011 1000 
BFA =   1001 1000 1000 0101 (or operation) 
================================ 
Updated BFA = 1011 1000 1011 1101

所以，現在蘋果和桔子插入BF1 W的1011 1000 1011 1101

/終值執行相同的BFB

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB 
0011 1001 0000 0011 = BF(peas) 

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB 

0100 0111 1100 0100 = BF(carrots) 

so BFB = 
0011 1001 0000 0011 BF(peas) 
0100 0111 1100 0100 BF(carrots) 
=================== ('add' them to BFB via locial or op) 
0111 1111 1100 0111

您現在可以在循環中搜索B中的條目，也可以搜索B中的副詞：

確實乙包含「桔子」 =>

1011 1000 0011 1000 (Oranges BF representation) 
0111 1111 1100 0111 (BFB) 
=====================  (and operation) 
0011 1000 0000 0000

因爲這個結果(0011 1000 0000 0000)不匹配橙子的原BF，你可以肯定的是，B不包含任何橘子

......（對於其餘物品）

以下，B不包含任何A項，就像B不包含任何蘋果一樣。

我不認爲這就是你問的，但看起來你可以計算機上的差異高爐，這是更重要的。好像你可以做一個異或運算，並且會給你同時包含差異的「單」數組：

0111 1111 1100 0111 (BFB) 
1011 1000 1011 1101 (BFA) 
======================== 
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

與此單BF的意思，你可以檢測項目的非腦幹的時間100％，只是不存在該項目的100％。

你會使用它的方式，如下（檢查是否豆豆「從A丟失）：

1100 0111 0111 1010 (BFA xor BFB) 
0011 1001 0000 0011 (Peas) 
============================== (And operation) 
0000 0001 0000 0010 (non-zero)

因爲(BFA xor BFB) && (Peas) != 0你知道一集不包含‘豌豆’......

再次，你會逐項測試，也許你可以做聚合，但可能不是一個好主意......

希望這有助於！

來源

2012-11-15 21:49:55

組合布隆過濾器

回答

相關問題