I'm looking for a hash function that:
- Hashes textual strings well (e.g. few collisions)
- Is written in Java, and widely used
- Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
- Bonus: Has a 128-bit variant.
- Bonus: Not CPU intensive.
Create an SHA-1 hash and then mask out the lowest 64bits.
Do you look at Apache commons lang?
But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...
DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.
I'm not aware of a function but here's an idea that might help:
You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.
Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).