How should I map long to int in hashCode()?

2019-02-02 02:48发布

问题:

I have a range of objects that have a long field whose value uniquely identifies a particular object across my entire system, much like a GUID. I have overriden Object.equals() to use this id for comparison, beause I want it to work with copies of the object. Now I want to override Object.hashCode(), too, which basically means mapping my long to some int return value.

If I understood the purpose of hashCode correctly, it is mainly used in hash tables, so a uniform distribution would be desirable. This would mean, simply returning id % 2^32 would suffice. Is that all, or should I be aware of something else?

回答1:

Since Java 8 you can use

Long.hashCode(guid);

For older versions of Java you can use the following:

Long.valueOf(guid).hashCode();

Note that this solution creates a new Object for the stack, while the first doesn't (although it is likely that Java optimizes the object creation away..)

Looking at the docs, both ways just use the following algorithm:

(int)(this.longValue()^(this.longValue()>>>32))

These are decent solutions since they make use of the Java library - always better to leverage off of something that has been tested already.



回答2:

It's a bit of a minor thing if you're not using Guava already, but Guava can do this for you nicely:

public int hashCode() {
  return Longs.hashCode(id);
}

That gives you the equivalent of Long.valueOf(id).hashCode():

return (int) (value ^ (value >>> 32));

Additionally, if you were to have other values or objects that were part of the hashcode, you could just write

return Objects.hashCode(longValue, somethingElse, ...);

The long would be autoboxed into a Long so you'd get the correct hashcode for it as part of the overall hashcode.



回答3:

You have understood the purpose of hashCode correctly. Yes, an uniform distribution is desirable (although not an actual requirement).

I would suggest ((id >> 32) ^ id).

The above expression:

  • Uses all bits of the original value, does not discard any information upfront. For example, depending on how you are generating the IDs, the upper bits could change more frequently (or the opposite).
  • Does not introduce any bias towards values with more ones (zeros), as it would be the case if the two halves were combined with an OR (AND) operation.


回答4:

Java 8 adds Long.hashCode(long) to the JDK.

The following code could yield higher performance. This code reduces the calculation to 32-bit int instead of computing with 64-bit long. This can make a difference on 32-bit and smaller architectures. 32-bit processes on x86 machines could optimize this into a single instruction which simply XORs 2 registers.

return (int)(value ^ (value >>> 32));

As noted in other answers, this does not have a good avalanche effect and hence could lead to collisions. One could go with cryptographic hash functions to ensure high avalanche effect. However, there are other algorithms such as Murmur Hash (more information) which have very good avalanche effect but don't consume as much CPU time.



回答5:

(l >> 32) ^ l is a good hashcode in most cases; particularly when the long has a uniform distribution.

Since it was the accepted answer, I'm posting this to clarify some of my comments about when it's NOT a good hashcode for a long.

The example I gave was a Point class like this:

public class Point {
    private final long coords; //x in high-bits, y in low
    public int getX() {
        return (int)(coords >> 32);
    }
    public int getY() {
        return (int)coords;
    }
    public int hashCode() {
        return (int)((coords >> 32) ^ (coords));
    }
}

It may seem contrived, but occasionally you have multiple "fields" packed into a long.

So the coords field represents 32 bits of x and 32 bits of y. So why is this a problem? Well, it's not if each of x and y are evenly distributed over their respective 32 bits. But that's unlikely in practice. What is more likely is that X and Y are bounded by some number. Let's say 1024 since it's 2^10. This means that at most the lower 10 bits of each X and Y are set:

00000000 00000000 000000XX XXXXXXXX 00000000 00000000 000000YY YYYYYYYY

There are 2^20 (1024*1024) possible combinations. But what's the operation hashCode is doing?

  00000000 00000000 000000XX XXXXXXXX 
^ 00000000 00000000 000000YY YYYYYYYY
-------------------------------------
= 00000000 00000000 000000?? ????????

There are at most 2^10 (1024) possible hashCode values since only the lower 10 bits can ever be anything other than zero. The ratio of hash values to real values is 1024:(1024*1024) or 1:1024. So right off the bat there is a 1/1024 probability that two numbers have the same hash.

Now let's calculate the probability of a collision by applying math from the birthday problem. Let p(n) be the probability that with n values there will be at least one collision. We know that p(1025+) = 1 since there are only 1024 values.

p(n) = 1 - (n! * (1024 choose n))/1024^n

This works out to the following:

n: p(n)
1: 0.00000
2: 0.00098
3: 0.00293
4: 0.00585
5: 0.00973
6: 0.01457
...
38: 0.50096
...
79: 0.95444
...
148: 0.99999

With just 38 items, there is probably a collision. With 148 items, there is a 99.999% chance of (at least one) collision. With 148 items, each item has a 7% chance of colliding with another item. With a proper hashing function, taking knowledge of the domain, these numbers could easily go down to 0.

In other words, knowing your domain and how things happen in practice are key to making a performant hash. Library functions try to do as good a job as possible knowing nothing about your domain, and to be performant typically rely on a distribution of data that won't occur in practice.



回答6:

int result = (int)((longVal >> 32) ^ longVal);

will be more well distributed, because modulo will not return different value if only upper bits of your long value has changed.