2012-11-09 92 views
15

我需要(也找不到)純粹的python(無C++)實現MurmurHash,我太新手寫自己了。速度或內存使用量與我的項目無關。是否有MurmurHash的純Python實現?

我找到了一個嘗試here,但它僅限於31位散列,我真的需要64位散列。

注:對於那些誰需要一個快速的實現,有一個MurmurHash2庫here和MurmurHash3庫here

+0

爲什麼你想要一個純粹的Python實現? – 2012-11-09 09:42:39

+4

我需要一個純Python實現,因爲我的應用程序需要在不能執行非python代碼的平臺上運行(如Google App Engine)。 – greg

+0

還有這個,但我認爲你發現的mmh3看起來更好看。 https://code.google.com/p/pyfasthash/ –

回答

7

這是未經測試(對不起!),但這裏是我想出了一個版本。 Python允許任意大的整數,所以我爲前8個字節(或64位)創建了一個掩碼,然後對所有可能產生大於64位整數的算術結果應用(通過按位與)。 也許別人可以在一般的方法和可能出現的問題與字節序,評論等

def bytes_to_long(bytes): 
    assert len(bytes) == 8 
    return sum((b << (k * 8) for k, b in enumerate(bytes))) 


def murmur(data, seed): 

    m = 0xc6a4a7935bd1e995 
    r = 47 

    MASK = 2 ** 64 - 1 

    data_as_bytes = bytearray(data) 

    h = seed^((m * len(data_as_bytes)) & MASK) 

    for ll in range(0, len(data_as_bytes), 8): 
     k = bytes_to_long(data_as_bytes[ll:ll + 8]) 
     k = (k * m) & MASK 
     k = k^((k >> r) & MASK) 
     k = (k * m) & MASK 
     h = (h^k) 
     h = (h * m) & MASK 

    l = len(data_as_bytes) & 7 

    if l >= 7: 
     h = (h^(data_as_bytes[6] << 48)) 

    if l >= 6: 
     h = (h^(data_as_bytes[5] << 40)) 

    if l >= 5: 
     h = (h^(data_as_bytes[4] << 32)) 

    if l >= 4: 
     h = (h^(data_as_bytes[3] << 24)) 

    if l >= 3: 
     h = (h^(data_as_bytes[4] << 16)) 

    if l >= 2: 
     h = (h^(data_as_bytes[4] << 8)) 

    if l >= 1: 
     h = (h^data_as_bytes[4]) 
     h = (h * m) & MASK 

    h = h^((h >> r) & MASK) 
    h = (h * m) & MASK 
    h = h^((h >> r) & MASK) 

    return h 
5

這裏去MurmurHash3 32的純Python實現,它只是哈希字符串,但可以很容易地適應採取字節數組來代替。這是算法的Java version中的一個端口。

def murmur3_x86_32(data, seed = 0): 
    c1 = 0xcc9e2d51 
    c2 = 0x1b873593 

    length = len(data) 
    h1 = seed 
    roundedEnd = (length & 0xfffffffc) # round down to 4 byte block 
    for i in range(0, roundedEnd, 4): 
     # little endian load order 
     k1 = (ord(data[i]) & 0xff) | ((ord(data[i + 1]) & 0xff) << 8) | \ 
      ((ord(data[i + 2]) & 0xff) << 16) | (ord(data[i + 3]) << 24) 
     k1 *= c1 
     k1 = (k1 << 15) | ((k1 & 0xffffffff) >> 17) # ROTL32(k1,15) 
     k1 *= c2 

     h1 ^= k1 
     h1 = (h1 << 13) | ((h1 & 0xffffffff) >> 19) # ROTL32(h1,13) 
     h1 = h1 * 5 + 0xe6546b64 

    # tail 
    k1 = 0 

    val = length & 0x03 
    if val == 3: 
     k1 = (ord(data[roundedEnd + 2]) & 0xff) << 16 
    # fallthrough 
    if val in [2, 3]: 
     k1 |= (ord(data[roundedEnd + 1]) & 0xff) << 8 
    # fallthrough 
    if val in [1, 2, 3]: 
     k1 |= ord(data[roundedEnd]) & 0xff 
     k1 *= c1 
     k1 = (k1 << 15) | ((k1 & 0xffffffff) >> 17) # ROTL32(k1,15) 
     k1 *= c2 
     h1 ^= k1 

    # finalization 
    h1 ^= length 

    # fmix(h1) 
    h1 ^= ((h1 & 0xffffffff) >> 16) 
    h1 *= 0x85ebca6b 
    h1 ^= ((h1 & 0xffffffff) >> 13) 
    h1 *= 0xc2b2ae35 
    h1 ^= ((h1 & 0xffffffff) >> 16) 

    return h1 & 0xffffffff 
2

修正錯誤在尼古拉的版本

def bytes_to_long(bytes): 
    assert len(bytes) == 8 
    return sum((b << (k * 8) for k, b in enumerate(bytes))) 


def murmur64(data, seed = 19820125): 

    m = 0xc6a4a7935bd1e995 
    r = 47 

    MASK = 2 ** 64 - 1 

    data_as_bytes = bytearray(data) 

    h = seed^((m * len(data_as_bytes)) & MASK) 

    off = len(data_as_bytes)/8*8 
    for ll in range(0, off, 8): 
     k = bytes_to_long(data_as_bytes[ll:ll + 8]) 
     k = (k * m) & MASK 
     k = k^((k >> r) & MASK) 
     k = (k * m) & MASK 
     h = (h^k) 
     h = (h * m) & MASK 

    l = len(data_as_bytes) & 7 

    if l >= 7: 
     h = (h^(data_as_bytes[off+6] << 48)) 

    if l >= 6: 
     h = (h^(data_as_bytes[off+5] << 40)) 

    if l >= 5: 
     h = (h^(data_as_bytes[off+4] << 32)) 

    if l >= 4: 
     h = (h^(data_as_bytes[off+3] << 24)) 

    if l >= 3: 
     h = (h^(data_as_bytes[off+2] << 16)) 

    if l >= 2: 
     h = (h^(data_as_bytes[off+1] << 8)) 

    if l >= 1: 
     h = (h^data_as_bytes[off]) 
     h = (h * m) & MASK 

    h = h^((h >> r) & MASK) 
    h = (h * m) & MASK 
    h = h^((h >> r) & MASK) 

    return h