I'm looking for a hash function that:
- Hashes textual strings well (e.g. few collisions)
- Is written in Java, and widely used
- Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
- Bonus: Has a 128-bit variant.
- Bonus: Not CPU intensive.
Do something like this:
DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.
Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.
SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the
DataOutputStream
and you're good to go. You could even do it with reflection and annotations (maybe@HashComponent(order=1)
to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call
MessageDigest.getInstance()
once and thenclone()
from then on: IIRC the cloning is a lot faster.Reverse the string to get another 32-bit hashcode and then combine the two:
This is pseudocode; the
String.reverse()
method doesn't exist and will need to be implemented some other way.Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.
There are plenty of implementations available on the net if you google "CRC64 Java"
Why don't you use a
long
variant of the defaultString.hashCode()
(where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?If you're looking for even more bits, you could probably use aEdit:BigInteger
As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:
To combine the hash for several fields, simply
do an XORmultiply one with a prime and add them:The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.
Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.
Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.
In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.
An answer for today (2018). SipHash.
It will be much faster than most of the answers here, and significantly higher quality than all of them.
The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--