hash algorithm for variable size boolean array [cl

2019-08-27 19:05发布

I have some boolean arrays that their sizes are not constant, And I need a strong and fast hash algorithm to give minimum chance of hash collision for them.

My own idea was calculating the integer value of each boolean array but for example these 2 arrays will give same hash of 3:
[0 , 1, 1] and [1, 1]

I thought to multiply the size of array after calculating integer value, but this idea also sucks, because there is a high chance of hash collision.

Does anyone has a good idea?

3条回答
三岁会撩人
2楼-- · 2019-08-27 19:12

My ideas:

Approach #1:

  1. Calculate the first 2n prime numbers, where n is the length of the array.

  2. Let hash = 1.

  3. For i = 0 to n: If a bit at position i is 1, multiply hash by the 2ith and 2i + 1st prime. If it's 0, multiply it by the 2ith one only.

Approach #2:

  1. Treat the binary arrays as ternary. Bit is 0 => ternary digit is 0; bit is 1 => ternary digit is 1; bit is not present => ternary digit is 2 (this former works because the array has a maximal possible length).

  2. Calculate the ternary number using this substitution - the result will be unique.


Here's some code showing the implementation of these algorithms in C++ and a test program which generates hashes for every boolean array of length 0...18. I use the C++11 class std::unordered_map so that each hash is uniqued. Thus, if we don't have any duplicates (i. e. if the hash function is perfect), we should get 2 ^ 19 - 1 elements in the set, which we do (I had to change the integers to unsigned long long on IDEone, else the hashes weren't perfect - I suspect this has to do with 32 vs. 64 bit architectures):

#include <unordered_set>
#include <iostream>

#define MAX_LEN 18

unsigned long prime_hash(const unsigned int *arr, size_t len)
{
    /* first 2 * MAX_LEN primes */
    static const unsigned long p[2 * MAX_LEN] = { 
          2,   3,   5,   7,  11,  13,  17,  19,  23,
         29,  31,  37,  41,  43,  47,  53,  59,  61,
         67,  71,  73,  79,  83,  89,  97, 101, 103,
        107, 109, 113, 127, 131, 137, 139, 149, 151
    };

    unsigned long h = 1;
    for (size_t i = 0; i < len; i++)
        h *= p[2 * i] * (arr[i] ? p[2 * i + 1] : 1);

    return h;
}

unsigned long ternary_hash(const unsigned int *arr, size_t len)
{
    static const unsigned long p3[MAX_LEN] = {
               1,            3,            9,           27,
              81,          243,          729,         2187,         
            6561,        19683,        59049,       177147,
          531441,      1594323,      4782969,     14348907,
        43046721,    129140163
    };

    unsigned long h = 0;
    for (size_t i = 0; i < len; i++)
        if (arr[i])
            h += p3[i];

    for (size_t i = len; i < MAX_LEN; i++)
        h += 2 * p3[i];

    return h;
}

void int2barr(unsigned int *dst, unsigned long n, size_t len)
{
    for (size_t i = 0; i < len; i++) {
        dst[i] = n & 1;
        n >>= 1;
    }
}

int main()
{
    std::unordered_set<unsigned long> phashes, thashes;


    /* generate all possible bool-arrays from length 0 to length 18 */

    /* first, we checksum the only 0-element array */
    phashes.insert(prime_hash(NULL, 0));
    thashes.insert(ternary_hash(NULL, 0));

    /* then we checksum the arrays of length 1...18 */
    for (size_t len = 1; len <= MAX_LEN; len++) {
        unsigned int bits[len];
        for (unsigned long i = 0; i < (1 << len); i++) {
            int2barr(bits, i, len);

            phashes.insert(prime_hash(bits, len));
            thashes.insert(ternary_hash(bits, len));
        }
    }

    std::cout << "prime hashes: " << phashes.size() << std::endl;
    std::cout << "ternary hashes: " << thashes.size() << std::endl;

    return 0;
}
查看更多
我命由我不由天
3楼-- · 2019-08-27 19:23

A simple an efficient hashcode is replacing 0 and 1 with prime numbers and do the usual shift-accumulator loop:

hash=0
for (bits in list):
    hash = hash*31 + 2*bit + 3
return hash

Here 0 is treated as 3 and 1 is treated as 5, so that leading zeros are not ignored. The multiplication by 31 makes sure that order matters. This isn't cryptographically strong though: given a hash code for a short sequence it's simple arithmetic to reverse it.

查看更多
一夜七次
4楼-- · 2019-08-27 19:24

Insert a sentinel true element at the start of the array, then interpret the array as a binary number. This is a perfect hash (no collisions) for arrays with less than 32 elements. For larger arrays I suggest doing the arithmetic modulo a large prime less than 231.

Examples:

Array       | Binary | Decimal
------------+--------+---------
[ 0, 1, 1 ] |  01011 |      11
[ 1, 1 ]    |  00111 |       7

This is the same as interpreting the array as a binary number and taking the bitwise OR with 1 << n, where n is the size of the array.

Implementation:

int hash(int[] array)
{
    int h = (1 << array.length);
    for (int i = 0; i < array.length; i++)
    {
        h = h | (array[i] << (array.length - i - 1));
    }
    return h;
}
查看更多
登录 后发表回答