【密码学】一文读懂MurMurHash2
上次我们聊过了一代的MurMurHash算法,是的,我又来水文章了,今天呢,接着来聊一下二代的MurMurHash算法,二代算法的整体结构实际上和一代算法差不太多,只是对于每一轮数据的处理过程当中的运算有一些差异,算法的来源依然是来自于Google官网给提供的源码,对着源码看的结构,对于这个算法呢,有两个版本,一个是32位的,一个是64位的,对于32位的算法和64位的算法,区别在于两个初始的魔数不同,整体运算过程还是十分相似的。
原始代码
先贴一下原汁原味的代码。
32位哈希算法
#include <iostream> //----------------------------------------------------------------------------- // MurmurHash2, by Austin Appleby // Note - This code makes a few assumptions about how your machine behaves - // 1. We can read a 4-byte value from any address without crashing // 2. sizeof(int) == 4 // And it has a few limitations - // 1. It will not work incrementally. // 2. It will not produce the same results on little-endian and big-endian // machines. // code from: https://sites.google.com/site/murmurhash/ unsigned int MurmurHash2(const void *key, int len, unsigned int seed) { // 'm' and 'r' are mixing constants generated offline. // They're not really 'magic', they just happen to work well. const unsigned int m = 0x5bd1e995; const int r = 24; // Initialize the hash to a 'random' value unsigned int h = seed ^ len; // Mix 4 bytes at a time into the hash const unsigned char *data = (const unsigned char *) key; while (len >= 4) { unsigned int k = *(unsigned int *) data; k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; data += 4; len -= 4; } // Handle the last few bytes of the input array switch (len) { case 3: h ^= data[2] << 16; case 2: h ^= data[1] << 8; case 1: h ^= data[0]; h *= m; }; // Do a few final mixes of the hash to ensure the last few // bytes are well-incorporated. h ^= h >> 13; h *= m; h ^= h >> 15; return h; }
64位哈希算法
//----------------------------------------------------------------------------- // MurmurHash2, 64-bit versions, by Austin Appleby // The same caveats as 32-bit MurmurHash2 apply here - beware of alignment // and endian-ness issues if used across multiple platforms. // 64-bit hash for 64-bit platforms // code from: https://sites.google.com/site/murmurhash/ #include <iostream> uint64_t MurmurHash64A(const void *key, int len, unsigned int seed) { const uint64_t m = 0xc6a4a7935bd1e995; const int r = 47; uint64_t h = seed ^ (len * m); const uint64_t *data = (const uint64_t *) key; const uint64_t *end = data + (len / 8); while (data != end) { uint64_t k = *data++; k *= m; k ^= k >> r; k *= m; h ^= k; h *= m; } const unsigned char *data2 = (const unsigned char *) data; switch (len & 7) { case 7: h ^= uint64_t(data2[6]) << 48; case 6: h ^= uint64_t(data2[5]) << 40; case 5: h ^= uint64_t(data2[4]) << 32; case 4: h ^= uint64_t(data2[3]) << 24; case 3: h ^= uint64_t(data2[2]) << 16; case 2: h ^= uint64_t(data2[1]) << 8; case 1: h ^= uint64_t(data2[0]); h *= m; }; h ^= h >> r; h *= m; h ^= h >> r; return h; } // 64-bit hash for 32-bit platforms uint64_t MurmurHash64B(const void *key, int len, unsigned int seed) { const unsigned int m = 0x5bd1e995; const int r = 24; unsigned int h1 = seed ^ len; unsigned int h2 = 0; const unsigned int *data = (const unsigned int *) key; while (len >= 8) { unsigned int k1 = *data++; k1 *= m; k1 ^= k1 >> r; k1 *= m; h1 *= m; h1 ^= k1; len -= 4; unsigned int k2 = *data++; k2 *= m; k2 ^= k2 >> r; k2 *= m; h2 *= m; h2 ^= k2; len -= 4; } if (len >= 4) { unsigned int k1 = *data++; k1 *= m; k1 ^= k1 >> r; k1 *= m; h1 *= m; h1 ^= k1; len -= 4; } switch (len) { case 3: h2 ^= ((unsigned char *) data)[2] << 16; case 2: h2 ^= ((unsigned char *) data)[1] << 8; case 1: h2 ^= ((unsigned char *) data)[0]; h2 *= m; } h1 ^= h2 >> 18; h1 *= m; h2 ^= h1 >> 22; h2 *= m; h1 ^= h2 >> 17; h1 *= m; h2 ^= h1 >> 19; h2 *= m; uint64_t h = h1; h = (h << 32) | h2; return h; }
如果读者看过前面讲过的一代的算法,那么对于二代算法来说,理解起来应该不算太难。
算法结构
这里,只画一个32位算法的图了,任性的偷懒一把。
MurMurHash
通过对比可以发现,这个实际上和一代的算法的整体结构是几乎一样的,只是哈希函数,和操作略有不同。
代码实现
Rust
还是用熟悉的语言,先用rust写一下吧。
use std::convert::TryInto; fn murmurhash2(key: &[u8], seed: u32) -> u32 { let m = 0x5bd1e995; let r = 24; let len = key.len() as u32; let mut h = seed ^ len; let remain = len % 4; let mut offset = len; if remain > 0 { offset = len - remain; } let data = key[..(offset as usize)].chunks(4).map(|it| u32::from_le_bytes(it.try_into().unwrap())).collect::<Vec<u32>>(); for k in data { let mut k = k; k = k.wrapping_mul(m); k ^= k >> r; k = k.wrapping_mul(m); h = h.wrapping_mul(m); h ^= k; } let shift_table = [0, 8, 16]; while offset < len { h ^= (key[offset as usize] as u32).wrapping_shl(shift_table[(len - offset - 1) as usize]); offset += 1; if (len - offset) == 0 { h = h.wrapping_mul(m); } } // Do a few final mixes of the hash to ensure the last few // bytes are well-incorporated. h ^= h >> 13; h = h.wrapping_mul(m); h ^= h >> 15; h } fn murmurhash2_64a(key: &[u8], seed: u64) -> u64 { let m = 0xc6a4a7935bd1e995u64; let r = 47; let len = key.len() as u64; let mut h = seed ^ (len.wrapping_mul(m)); let mut remain = len % 8; let mut offset = len; if remain > 0 { offset = len - remain; } let data = key[..(offset as usize)].chunks(8).map(|it| u64::from_le_bytes(it.try_into().unwrap())).collect::<Vec<u64>>(); for k in data { let mut k = k; k = k.wrapping_mul(m); k ^= k >> r; k = k.wrapping_mul(m); h ^= k; h = h.wrapping_mul(m); } let shift_table = [0, 8, 16, 24, 32, 40, 48]; while remain > 0 { h ^= (key[(len - (7 - remain)) as usize] as u64).wrapping_shl(shift_table[(remain - 1) as usize]); remain -= 1; if remain == 0 { h = h.wrapping_mul(m); } } // Do a few final mixes of the hash to ensure the last few // bytes are well-incorporated. h ^= h >> r; h = h.wrapping_mul(m); h ^= h >> r; h }
Go
这里打算学一下go,然后这也用go来写一下,咋写咋感觉写的一股c味。
func murmurhash2(key []byte, seed uint32) uint32 { m := uint32(0x5bd1e995) r := 24 keyLen := uint32(len(key)) h := seed ^ keyLen offset := 0 e := binary.LittleEndian for keyLen >= 4 { k := e.Uint32(key[offset : offset+4]) k *= m k ^= k >> r k *= m h *= m h ^= k offset += 4 keyLen -= 4 } switch keyLen { case 3: h ^= uint32(key[offset+2]) << 16 fallthrough case 2: h ^= uint32(key[offset+1]) << 8 fallthrough case 1: h ^= uint32(key[offset]) h *= m } h ^= h >> 13 h *= m h ^= h >> 15 return h } func murmurhash264a(key []byte, seed uint32) uint64 { m := uint64(0xc6a4a7935bd1e995) r := 47 keyLen := uint64(len(key)) h := uint64(seed) ^ (keyLen * m) offset := 0 e := binary.LittleEndian for keyLen >= 8 { k := e.Uint64(key[offset : offset+8]) fmt.Println(k) k *= m k ^= k >> r k *= m h ^= k h *= m offset += 8 keyLen -= 8 } switch keyLen { case 7: h ^= uint64(key[offset+6]) << 48 fallthrough case 6: h ^= uint64(key[offset+5]) << 40 fallthrough case 5: h ^= uint64(key[offset+4]) << 32 fallthrough case 4: h ^= uint64(key[offset+3]) << 24 fallthrough case 3: h ^= uint64(key[offset+2]) << 16 fallthrough case 2: h ^= uint64(key[offset+1]) << 8 fallthrough case 1: h ^= uint64(key[offset]) h *= m } h ^= h >> r h *= m h ^= h >> r return h }
小结
对于非密码学安全的哈希算法来说,这个实际上是比密码学安全的哈希过程要简单不少的,因为不需要保证一些安全特性,比如今天所讲解的MurMurHash对数据的处理只有一轮,处理完就结束了。