Why does Java's hashCode() in String use 31 as

2018-12-31 04:33发布

In Java, the hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation.

Why is 31 used as a multiplier?

I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?

10条回答
看淡一切
2楼-- · 2018-12-31 05:07

You can read Bloch's original reasoning under "Comments" in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting "average chain size" in a hash table. P(31) was one of the common functions during that time which he found in K&R's book (but even Kernighan and Ritchie couldn't remember where it came from). In the end he basically had to chose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime:

Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.

So the reasoning was not as rational as many of the answers here seem to imply. But we're all good in coming up with rational reasons after gut decisions (and even Bloch might be prone to that).

查看更多
梦寄多情
3楼-- · 2018-12-31 05:07

Actually, 37 would work pretty well! z := 37 * x can be computed as y := x + 8 * x; z := x + 4 * y. Both steps correspond to one LEA x86 instructions, so this is extremely fast.

In fact, multiplication with the even-larger prime 73 could be done at the same speed by setting y := x + 8 * x; z := x + 8 * y.

Using 73 or 37 (instead of 31) might be better, because it leads to denser code: The two LEA instructions only take 6 bytes vs. the 7 bytes for move+shift+subtract for the multiplication by 31. One possible caveat is that the 3-argument LEA instructions used here became slower on Intel's Sandy bridge architecture, with an increased latency of 3 cycles.

Moreover, 73 is Sheldon Cooper's favorite number.

查看更多
墨雨无痕
4楼-- · 2018-12-31 05:08

Bloch doesn't quite go into this, but the rationale I've always heard/believed is that this is basic algebra. Hashes boil down to multiplication and modulus operations, which means that you never want to use numbers with common factors if you can help it. In other words, relatively prime numbers provide an even distribution of answers.

The numbers that make up using a hash are typically:

  • modulus of the data type you put it into (2^32 or 2^64)
  • modulus of the bucket count in your hashtable (varies. In java used to be prime, now 2^n)
  • multiply or shift by a magic number in your mixing function
  • The input value

You really only get to control a couple of these values, so a little extra care is due.

查看更多
皆成旧梦
5楼-- · 2018-12-31 05:24

I'm not sure, but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings.

查看更多
刘海飞了
6楼-- · 2018-12-31 05:25

Neil Coffey explains why 31 is used under Ironing out the bias.

Basically using 31 gives you a more even set-bit probability distribution for the hash function.

查看更多
宁负流年不负卿
7楼-- · 2018-12-31 05:26

As Goodrich and Tamassia point out, If you take over 50,000 English words (formed as the union of the word lists provided in two variants of Unix), using the constants 31, 33, 37, 39, and 41 will produce less than 7 collisions in each case. Knowing this, it should come as no surprise that many Java implementations choose one of these constants.

Coincidentally, I was in the middle of reading the section "polynomial hash codes" when I saw this question.

EDIT: here is link to the ~10mb PDF book i'm referring to above. See section 10.2 Hash Tables (page 413) of Data Structures and Algorithms in Java

查看更多
登录 后发表回答