I'm looking into an associative array library in C (which I didn't write). Similar to a map in C++ or Python's dict's.
There are some non-standard hashing functions which I'm not certain if they are even very good. (maybe the original developer just threw some magic-numbers, xor-operators and hoped for the best).
I wrote a test which measures how well the hashing function performs given some sample input, to measure how evenly it distributes items into a fixed number of buckets (modulo array size in this case).
This way, given enough input, there would be some way to measure how well a hash function performs.
It seems like it must be a common issue for anyone writing an associative array.
Is there some convention for measuring how well a hash function performs? (in terms of distribution quality, not speed).
Where worst would be the same result for every input, and best would give an even distribution (or as close as possible).
Note, I'm not looking for cryptographic strength here.
There is a Formula (in the middle of the page) from the dragon book.
I personally have a rule of thumb: (assuming linear chaining) Insert N items into N slots->chains, and compute the total number of accesses (first in chain := 1 access; 2nd := 2 accesses, etc) needed to obtain all N elements. (this is equal SUM ( chainlen * (chainlen +1) /2) , summed over all chains)
Given random input data, for any reasonable hashfunction this metric should be 1.5 * N, or a bit below that.
Example of a typical run using a list of 2543846 unique tokens/words (and their statistics)
hashed into exactly 2543846 slots/buckets:
plasser@pisbak:~/src/hash$ ./diskhash woorden.txt woorden.hsh
Ptr = 0x7fb5c264f000, Sz = 37362821
Array= 0x7fb5bff7e000 Cnt = 2543846
__________________
Histogram of seek lenghts:
len: Count Hops Fraction (Cumulative)
1: 1606429 1606429 0.63149617 (0.63149617)
2: 672469 1344938 0.26435130 (0.89584747)
3: 205046 615138 0.08060472 (0.97645219)
4: 48604 194416 0.01910650 (0.99555869)
5: 9477 47385 0.00372546 (0.99928415)
6: 1581 9486 0.00062150 (0.99990565)
7: 215 1505 0.00008452 (0.99999017)
8: 24 192 0.00000943 (0.99999961)
9: 1 9 0.00000039 (1.00000000)
Tot: 2543846 3819498 (1.50147)
Cnt 2543846 Empty 937417 (0.36850) Collisions 247 RedDragon 7638996/7631537=1.000977
__________________
- the fraction of empty slots is 0.36850 , which is about what it should be (1/e)
- the fraction of slots with more than one item (chain-length > 1) is also about (1/e)
- the fraction of slots with exactly 1 item is what is left :: 1 - (2/e)
- the number of collisions seems a bit high, but with 2.5 M items on a 32bits hashvalue it is not exceptional.
A useful benchmark is repeatable randomness - by which I mean the collision proneness of having each distinct key assigned to a randomly selected bucket.
For example, if you have 500 values in a 1000 bucket table, then achieving this benchmark means your chance of a collision when inserting a new value is 0.5 (regardless of how it compares to existing elements).
If you have some instrumentation in your hash table so you know how many collisions happen on inserting a value, then you could - for example - insert elements until you hit a specific size():buckets
ratio like 0.5, then loop inserting one element (and accumulating a collisions counter) then deleting another until you have a good sample size. You can then contrast the collision rate with 0.5 to get a feel for the quality of your hash function. Alterantively, your instrumentation might let you insert a plethora of values, then iterate over them asking how many colliding elements had to be skipped over to get from the bucket that element hashes to to the actual element. You could also measure the number of collisions during insertion of all the values, but you've then got to consider the range of size():buckets
ratios in play between re-sizings of the hash table.
Based on @wildplasser's answer, here's the method from the Red Dragon Book, as a test function in Python3.x, to verify it works as expected, heres the result:
from random import random
def hash_in_range(bucket_tot):
# Simply return a random number to simulate a fairly even hash function:
# using `random()` -> `~1.0` score
# replace `random()` with `random() * 0.5` -> `~2.0` score
# replace `random()` with `random() * 0.1` -> `~10.0` score
return int(random() * bucket_tot)
def count_quality(bucket_items):
sum = 0
for count in bucket_items:
sum += count * (count + 1)
return (sum * len(bucket_items) / (items_count * (items_count + 2 * len(bucket_items) - 1)))
def test(bucket_tot, items_count):
bucket_items = [0] * bucket_tot
for i in range(items_count):
bucket = hash_in_range(bucket_tot)
bucket_items[bucket] += 1
print("Testing:", bucket_tot, items_count, end="\t")
score = count_quality(bucket_items, items_count)
print(score)
test(100, 10000)
test(1000, 100000)
test(10000, 1000000)
test(1000, 10000)
test(1000, 100000)
test(100, 1000000)
test(1000, 1000000)
test(500, 100000)
test(25, 100000)
And example C function, where buckets are an array of single-linked-list pointers.
double hash_calc_quality(struct Hash *hash)
{
uint64_t sum = 0;
unsigned int i;
if (hash->nentries == 0)
return -1.0;
for (i = 0; i < hash->nbuckets; i++) {
uint64_t count = 0;
Entry *e;
for (e = hash->buckets[i]; e; e = e->next) {
count += 1;
}
sum += count * (count + 1);
}
return ((double)sum * (double)hash->nbuckets /
((double)hash->nentries * (hash->nentries + 2 * hash->nbuckets - 1)));
}