In Java, the hash code for a String
object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int
arithmetic, where s[i]
is the i
th character of the string, n
is the length of the string, and ^
indicates exponentiation.
Why is 31 used as a multiplier?
I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?
You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table.
P(31)
was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to chose one and so he tookP(31)
since it seemed to perform well enough. Even thoughP(33)
was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).
Actually, 37 would work pretty well! z := 37 * x can be computed as
y := x + 8 * x; z := x + 4 * y
. Both steps correspond to one LEA x86 instructions, so this is extremely fast.In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting
y := x + 8 * x; z := x + 8 * y
.Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.
Moreover, 73 is Sheldon Cooper's favorite number.
Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.
The numbers that make up using a hash are typically:
You really only get to control a couple of these values, so a little extra care is due.
I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.
Neil Coffey explains why 31 is used under Ironing out the bias.
Basically using 31 gives you a more even set-bit probability distribution for the hash function.
As Goodrich and Tamassia point out, If you take over 50,000 English words (formed as the union of the word lists provided in two variants of Unix), using the constants 31, 33, 37, 39, and 41 will produce less than 7 collisions in each case. Knowing this, it should come as no surprise that many Java implementations choose one of these constants.
Coincidentally, I was in the middle of reading the section "polynomial hash codes" when I saw this question.
EDIT: here is link to the ~10mb PDF book i'm referring to above. See section 10.2 Hash Tables (page 413) of Data Structures and Algorithms in Java