Measuring a hash functions quality (for use with m

2019-07-23 12:22发布

问题:

I'm looking into an associative array library in C (which I didn't write). Similar to a map in C++ or Python's dict's.

There are some non-standard hashing functions which I'm not certain if they are even very good. (maybe the original developer just threw some magic-numbers, xor-operators and hoped for the best).

I wrote a test which measures how well the hashing function performs given some sample input, to measure how evenly it distributes items into a fixed number of buckets (modulo array size in this case).

This way, given enough input, there would be some way to measure how well a hash function performs.

It seems like it must be a common issue for anyone writing an associative array.


Is there some convention for measuring how well a hash function performs? (in terms of distribution quality, not speed).

Where worst would be the same result for every input, and best would give an even distribution (or as close as possible).

Note, I'm not looking for cryptographic strength here.

回答1:

There is a Formula (in the middle of the page) from the dragon book.

I personally have a rule of thumb: (assuming linear chaining) Insert N items into N slots->chains, and compute the total number of accesses (first in chain := 1 access; 2nd := 2 accesses, etc) needed to obtain all N elements. (this is equal SUM ( chainlen * (chainlen +1) /2) , summed over all chains)

Given random input data, for any reasonable hashfunction this metric should be 1.5 * N, or a bit below that.


Example of a typical run using a list of 2543846 unique tokens/words (and their statistics) hashed into exactly 2543846 slots/buckets:

plasser@pisbak:~/src/hash$ ./diskhash woorden.txt woorden.hsh
Ptr = 0x7fb5c264f000, Sz = 37362821
Array= 0x7fb5bff7e000 Cnt = 2543846
__________________
Histogram of seek lenghts:
len:    Count     Hops   Fraction (Cumulative)
  1:  1606429  1606429 0.63149617 (0.63149617)
  2:   672469  1344938 0.26435130 (0.89584747)
  3:   205046   615138 0.08060472 (0.97645219)
  4:    48604   194416 0.01910650 (0.99555869)
  5:     9477    47385 0.00372546 (0.99928415)
  6:     1581     9486 0.00062150 (0.99990565)
  7:      215     1505 0.00008452 (0.99999017)
  8:       24      192 0.00000943 (0.99999961)
  9:        1        9 0.00000039 (1.00000000)
Tot:  2543846  3819498           (1.50147)
Cnt  2543846 Empty   937417 (0.36850) Collisions 247 RedDragon 7638996/7631537=1.000977
__________________
  • the fraction of empty slots is 0.36850 , which is about what it should be (1/e)
  • the fraction of slots with more than one item (chain-length > 1) is also about (1/e)
  • the fraction of slots with exactly 1 item is what is left :: 1 - (2/e)
  • the number of collisions seems a bit high, but with 2.5 M items on a 32bits hashvalue it is not exceptional.


回答2:

A useful benchmark is repeatable randomness - by which I mean the collision proneness of having each distinct key assigned to a randomly selected bucket.

For example, if you have 500 values in a 1000 bucket table, then achieving this benchmark means your chance of a collision when inserting a new value is 0.5 (regardless of how it compares to existing elements).

If you have some instrumentation in your hash table so you know how many collisions happen on inserting a value, then you could - for example - insert elements until you hit a specific size():buckets ratio like 0.5, then loop inserting one element (and accumulating a collisions counter) then deleting another until you have a good sample size. You can then contrast the collision rate with 0.5 to get a feel for the quality of your hash function. Alterantively, your instrumentation might let you insert a plethora of values, then iterate over them asking how many colliding elements had to be skipped over to get from the bucket that element hashes to to the actual element. You could also measure the number of collisions during insertion of all the values, but you've then got to consider the range of size():buckets ratios in play between re-sizings of the hash table.



回答3:

Based on @wildplasser's answer, here's the method from the Red Dragon Book, as a test function in Python3.x, to verify it works as expected, heres the result:

from random import random
def hash_in_range(bucket_tot):
    # Simply return a random number to simulate a fairly even hash function:
    # using                   `random()`       -> `~1.0` score
    # replace `random()` with `random() * 0.5` -> `~2.0` score
    # replace `random()` with `random() * 0.1` -> `~10.0` score

    return int(random() * bucket_tot)

def count_quality(bucket_items):
    sum = 0
    for count in bucket_items:
        sum += count * (count + 1)
    return (sum * len(bucket_items) / (items_count * (items_count + 2 * len(bucket_items) - 1)))


def test(bucket_tot, items_count):
    bucket_items = [0] * bucket_tot

    for i in range(items_count):
        bucket = hash_in_range(bucket_tot)
        bucket_items[bucket] += 1

    print("Testing:", bucket_tot, items_count, end="\t")
    score = count_quality(bucket_items, items_count)
    print(score)


test(100,       10000)
test(1000,      100000)
test(10000,     1000000)
test(1000,      10000)
test(1000,      100000)
test(100,       1000000)
test(1000,      1000000)
test(500,       100000)
test(25,        100000)

And example C function, where buckets are an array of single-linked-list pointers.

double hash_calc_quality(struct Hash *hash)
{
    uint64_t sum = 0;
    unsigned int i;

    if (hash->nentries == 0)
        return -1.0;

    for (i = 0; i < hash->nbuckets; i++) {
        uint64_t count = 0;
        Entry *e;
        for (e = hash->buckets[i]; e; e = e->next) {
            count += 1;
        }
        sum += count * (count + 1);
    }
    return ((double)sum * (double)hash->nbuckets /
            ((double)hash->nentries * (hash->nentries + 2 * hash->nbuckets - 1)));
}