【密码学】一文读懂MurMurHash2

2022-06-12 2297

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 上次我们聊过了一代的MurMurHash算法，是的，我又来水文章了，今天呢，接着来聊一下二代的MurMurHash算法，二代算法的整体结构实际上和一代算法差不太多，只是对于每一轮数据的处理过程当中的运算有一些差异，算法的来源依然是来自于Google官网给提供的源码，对着源码看的结构，对于这个算法呢，有两个版本，一个是32位的，一个是64位的，对于32位的算法和64位的算法，区别在于两个初始的魔数不同，整体运算过程还是十分相似的。

【密码学】一文读懂MurMurHash2

YXDGY[6I7JLU}U7U~W[L)]A.jpg

上次我们聊过了一代的MurMurHash算法，是的，我又来水文章了，今天呢，接着来聊一下二代的MurMurHash算法，二代算法的整体结构实际上和一代算法差不太多，只是对于每一轮数据的处理过程当中的运算有一些差异，算法的来源依然是来自于Google官网给提供的源码，对着源码看的结构，对于这个算法呢，有两个版本，一个是32位的，一个是64位的，对于32位的算法和64位的算法，区别在于两个初始的魔数不同，整体运算过程还是十分相似的。

原始代码

先贴一下原汁原味的代码。

32位哈希算法

#include <iostream>
//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby
// Note - This code makes a few assumptions about how your machine behaves -
// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4
// And it has a few limitations -
// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.
// code from: https://sites.google.com/site/murmurhash/
unsigned int MurmurHash2(const void *key, int len, unsigned int seed) {
    // 'm' and 'r' are mixing constants generated offline.
    // They're not really 'magic', they just happen to work well.
    const unsigned int m = 0x5bd1e995;
    const int r = 24;
    // Initialize the hash to a 'random' value
    unsigned int h = seed ^ len;
    // Mix 4 bytes at a time into the hash
    const unsigned char *data = (const unsigned char *) key;
    while (len >= 4) {
        unsigned int k = *(unsigned int *) data;
        k *= m;
        k ^= k >> r;
        k *= m;
        h *= m;
        h ^= k;
        data += 4;
        len -= 4;
    }
    // Handle the last few bytes of the input array
    switch (len) {
        case 3:
            h ^= data[2] << 16;
        case 2:
            h ^= data[1] << 8;
        case 1:
            h ^= data[0];
            h *= m;
    };
    // Do a few final mixes of the hash to ensure the last few
    // bytes are well-incorporated.
    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;
    return h;
}

64位哈希算法

//-----------------------------------------------------------------------------
// MurmurHash2, 64-bit versions, by Austin Appleby
// The same caveats as 32-bit MurmurHash2 apply here - beware of alignment
// and endian-ness issues if used across multiple platforms.
// 64-bit hash for 64-bit platforms
// code from: https://sites.google.com/site/murmurhash/
#include <iostream>
uint64_t MurmurHash64A(const void *key, int len, unsigned int seed) {
    const uint64_t m = 0xc6a4a7935bd1e995;
    const int r = 47;
    uint64_t h = seed ^ (len * m);
    const uint64_t *data = (const uint64_t *) key;
    const uint64_t *end = data + (len / 8);
    while (data != end) {
        uint64_t k = *data++;
        k *= m;
        k ^= k >> r;
        k *= m;
        h ^= k;
        h *= m;
    }
    const unsigned char *data2 = (const unsigned char *) data;
    switch (len & 7) {
        case 7:
            h ^= uint64_t(data2[6]) << 48;
        case 6:
            h ^= uint64_t(data2[5]) << 40;
        case 5:
            h ^= uint64_t(data2[4]) << 32;
        case 4:
            h ^= uint64_t(data2[3]) << 24;
        case 3:
            h ^= uint64_t(data2[2]) << 16;
        case 2:
            h ^= uint64_t(data2[1]) << 8;
        case 1:
            h ^= uint64_t(data2[0]);
            h *= m;
    };
    h ^= h >> r;
    h *= m;
    h ^= h >> r;
    return h;
}
// 64-bit hash for 32-bit platforms
uint64_t MurmurHash64B(const void *key, int len, unsigned int seed) {
    const unsigned int m = 0x5bd1e995;
    const int r = 24;
    unsigned int h1 = seed ^ len;
    unsigned int h2 = 0;
    const unsigned int *data = (const unsigned int *) key;
    while (len >= 8) {
        unsigned int k1 = *data++;
        k1 *= m;
        k1 ^= k1 >> r;
        k1 *= m;
        h1 *= m;
        h1 ^= k1;
        len -= 4;
        unsigned int k2 = *data++;
        k2 *= m;
        k2 ^= k2 >> r;
        k2 *= m;
        h2 *= m;
        h2 ^= k2;
        len -= 4;
    }
    if (len >= 4) {
        unsigned int k1 = *data++;
        k1 *= m;
        k1 ^= k1 >> r;
        k1 *= m;
        h1 *= m;
        h1 ^= k1;
        len -= 4;
    }
    switch (len) {
        case 3:
            h2 ^= ((unsigned char *) data)[2] << 16;
        case 2:
            h2 ^= ((unsigned char *) data)[1] << 8;
        case 1:
            h2 ^= ((unsigned char *) data)[0];
            h2 *= m;
    }
    h1 ^= h2 >> 18;
    h1 *= m;
    h2 ^= h1 >> 22;
    h2 *= m;
    h1 ^= h2 >> 17;
    h1 *= m;
    h2 ^= h1 >> 19;
    h2 *= m;
    uint64_t h = h1;
    h = (h << 32) | h2;
    return h;
}

如果读者看过前面讲过的一代的算法，那么对于二代算法来说，理解起来应该不算太难。

算法结构

这里，只画一个32位算法的图了，任性的偷懒一把。

$9{(_6_L9O2CJI3$N0}PSNPT.png$ MurMurHash

通过对比可以发现，这个实际上和一代的算法的整体结构是几乎一样的，只是哈希函数，和操作略有不同。

代码实现

Rust

还是用熟悉的语言，先用rust写一下吧。

use std::convert::TryInto;
fn murmurhash2(key: &[u8], seed: u32) -> u32 {
    let m = 0x5bd1e995;
    let r = 24;
    let len = key.len() as u32;
    let mut h = seed ^ len;
    let remain = len % 4;
    let mut offset = len;
    if remain > 0 {
        offset = len - remain;
    }
    let data = key[..(offset as usize)].chunks(4).map(|it| u32::from_le_bytes(it.try_into().unwrap())).collect::<Vec<u32>>();
    for k in data {
        let mut k = k;
        k = k.wrapping_mul(m);
        k ^= k >> r;
        k = k.wrapping_mul(m);
        h = h.wrapping_mul(m);
        h ^= k;
    }
    let shift_table = [0, 8, 16];
    while offset < len {
        h ^= (key[offset as usize] as u32).wrapping_shl(shift_table[(len - offset - 1) as usize]);
        offset += 1;
        if (len - offset) == 0 {
            h = h.wrapping_mul(m);
        }
    }
    // Do a few final mixes of the hash to ensure the last few
    // bytes are well-incorporated.
    h ^= h >> 13;
    h = h.wrapping_mul(m);
    h ^= h >> 15;
    h
}
fn murmurhash2_64a(key: &[u8], seed: u64) -> u64 {
    let m = 0xc6a4a7935bd1e995u64;
    let r = 47;
    let len = key.len() as u64;
    let mut h = seed ^ (len.wrapping_mul(m));
    let mut remain = len % 8;
    let mut offset = len;
    if remain > 0 {
        offset = len - remain;
    }
    let data = key[..(offset as usize)].chunks(8).map(|it| u64::from_le_bytes(it.try_into().unwrap())).collect::<Vec<u64>>();
    for k in data {
        let mut k = k;
        k = k.wrapping_mul(m);
        k ^= k >> r;
        k = k.wrapping_mul(m);
        h ^= k;
        h = h.wrapping_mul(m);
    }
    let shift_table = [0, 8, 16, 24, 32, 40, 48];
    while remain > 0 {
        h ^= (key[(len - (7 - remain)) as usize] as u64).wrapping_shl(shift_table[(remain - 1) as usize]);
        remain -= 1;
        if remain == 0 {
            h = h.wrapping_mul(m);
        }
    }
    // Do a few final mixes of the hash to ensure the last few
    // bytes are well-incorporated.
    h ^= h >> r;
    h = h.wrapping_mul(m);
    h ^= h >> r;
    h
}

Go

这里打算学一下go，然后这也用go来写一下，咋写咋感觉写的一股c味。

func murmurhash2(key []byte, seed uint32) uint32 {
  m := uint32(0x5bd1e995)
  r := 24
  keyLen := uint32(len(key))
  h := seed ^ keyLen
  offset := 0
  e := binary.LittleEndian
  for keyLen >= 4 {
    k := e.Uint32(key[offset : offset+4])
    k *= m
    k ^= k >> r
    k *= m
    h *= m
    h ^= k
    offset += 4
    keyLen -= 4
  }
  switch keyLen {
  case 3:
    h ^= uint32(key[offset+2]) << 16
    fallthrough
  case 2:
    h ^= uint32(key[offset+1]) << 8
    fallthrough
  case 1:
    h ^= uint32(key[offset])
    h *= m
  }
  h ^= h >> 13
  h *= m
  h ^= h >> 15
  return h
}
func murmurhash264a(key []byte, seed uint32) uint64 {
  m := uint64(0xc6a4a7935bd1e995)
  r := 47
  keyLen := uint64(len(key))
  h := uint64(seed) ^ (keyLen * m)
  offset := 0
  e := binary.LittleEndian
  for keyLen >= 8 {
    k := e.Uint64(key[offset : offset+8])
    fmt.Println(k)
    k *= m
    k ^= k >> r
    k *= m
    h ^= k
    h *= m
    offset += 8
    keyLen -= 8
  }
  switch keyLen {
  case 7:
    h ^= uint64(key[offset+6]) << 48
    fallthrough
  case 6:
    h ^= uint64(key[offset+5]) << 40
    fallthrough
  case 5:
    h ^= uint64(key[offset+4]) << 32
    fallthrough
  case 4:
    h ^= uint64(key[offset+3]) << 24
    fallthrough
  case 3:
    h ^= uint64(key[offset+2]) << 16
    fallthrough
  case 2:
    h ^= uint64(key[offset+1]) << 8
    fallthrough
  case 1:
    h ^= uint64(key[offset])
    h *= m
  }
  h ^= h >> r
  h *= m
  h ^= h >> r
  return h
}

小结

对于非密码学安全的哈希算法来说，这个实际上是比密码学安全的哈希过程要简单不少的，因为不需要保证一些安全特性，比如今天所讲解的MurMurHash对数据的处理只有一轮，处理完就结束了。

【密码学】一文读懂MurMurHash2

【密码学】一文读懂MurMurHash2

原始代码

32位哈希算法

64位哈希算法

算法结构

代码实现

Rust

Go

小结

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【密码学】一文读懂MurMurHash2

【密码学】一文读懂MurMurHash2

原始代码

32位哈希算法

64位哈希算法

算法结构

代码实现

Rust

Go

小结

热门文章

最新文章

相关电子书