hashing in Java — structure & access time

2019-02-24 05:50发布

问题:

I am looking for verification on two different but related arguments-- those above (A) and below (B) the first line line-comment here in the Q.

(A) The way HashMap is structured is:

a HashMap is a plain table. thats direct memory access (DMA).

The whole idea behind HashMap (or hashing in general) at the first place is to put into use this constant-time memory access for

a.) accessing records by their own data content (< K,V >), not by their locations in DMA (the table index)

b.) managing variable number of records-- a number of records not of a given size, and may/not remain constant in size throughout the use of this structure.

So, the overall structure in a Java Hash is:

a table: table // i`m using the identifier used in HashMap

each cell of this table is a bucket.

Each bucket is a linked list of type Entry-- i.e., each node of this linked list (not the linked list of Java/API, but the data structure) is of type Entry which in turn is a < K,V > pair.

When a new pair comes in to be added to the hash, a unique hashCode is calculated for this < K,V > pair. This hashCode is the key to the index of this < K,V > in table-- it tells which bucket this < K,V > will go in in the hash. Note: hashCode is "normalized" thru the function hash() (in HashMap for one) to better-fit the current length of the table. indexFor() is also at use to determine which bucket, i.e., cell of table the < K,V > will go in.

When the bucket is determined, the < K,V > is added to the beginning of the linked list in this bucket-- as a result, it is the first < K,V > entry in this bucket and the first entry of the linked-list-that-already-existed is now the "next" entry that is pointed by this newly added one.

//===============================================================

(B) From what I see in HashMap, the resizing of the table-- the hash is only done upon a decision based on hash size and capacity, which are the current and max. # entries in the entire hash.

There is no re-structuring or resizing upon individual bucket sizes-- like "resize() when the max.#entries in a bucket exceeds such&such".

It is not probable, but is possible that a significant number of entries may be bulked up in a bucket while the rest of the hash is pretty much empty.

If this is the case, i.e., no upper limit on the size of each bucket, hash is not of constant but linear access-- theoretically for one thing. It takes $O(n)$ time to get hold of an entry in hash where $n$ is the total number of entries. But then it shouldn't be.

//===============================================================

I don't think I'm missing anything in Part (A) above.

I'm not entirely sure of Part (B). It is a significant issue and I'm looking to find out how accurate this argument is.

I'm looking for verification on both parts.

Thanks in advance.

//===============================================================

EDIT:

Maximum bucket size being fixed, i.e., hash being restructured whenever the #entries in a bucket hits a maximum would resolve it-- the access time is plain constant in theory and in use.

This isn't a well structured but quick fix, and would work just fine for sake of constant access.

The hashCodes are likely to be evenly distributed throughout the buckets and it isn`t so likely that anyone of the buckets will hit the bucket-max before the threshold on the overall size of the hash is hit. This is the assumption the current setup of HashMap is using as well.

Also based on Peter Lawrey`s discussion below.

回答1:

Collisions in HashMap are only a problem in pathological cases such as denial of service attacks.

In Java 7, you can change the hashing strategy such that an external party cannot predict your hashing algo.

AFAIK, In Java 8 HashMap for a String key will use a tree map instead of a linked list for collisions. This means O(ln N) worst case instead of O(n) access times.



回答2:

I'm looking to increase the table size when everything is in the same hash. The hash-to-bucket mapping changes when the size of the table does.

Your idea sounds good. And it is completely true and basically what HashMap does when the table size is smaller than desired / the average amount of elements per bucket gets too large. It's not doing that by looking at each bucket and checking if there is too much in there because it's easy to calculate that.

The implementation of HashMap.get() in OpenJDK according to this is

public V get(Object key) {
    if (key == null)
        return getForNullKey();
    int hash = hash(key.hashCode());
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
            return e.value;
    }
    return null;
}

That shows how HashMap finds elements pretty good but it's written in very confusing ways. After a bit of renaming, commenting and rewriting it could look roughly like this:

public V get(Object key) {
    if (key == null)
        return getForNullKey();

    // get key's hash & try to fix the distribution.
    // -> this can modify every 42 that goes in into a 9
    // but can't change it once to a 9 once to 8
    int hash = hash(key.hashCode());

    // calculate bucket index, same hash must result in same index as well
    // since table length is fixed at this point.
    int bucketIndex = indexFor(hash, table.length);
    // we have just found the right bucket. O(1) so far.
    // and this is the whole point of hash based lookup:
    // instantly knowing the nearly exact position where to find the element.


    // next see if key is found in the bucket > get the list in the bucket
    LinkedList<Entry> bucketContentList = table[bucketIndex];

    // check each element, in worst case O(n) time if everything is in this bucket.
    for (Entry entry : bucketContentList) {
        if (entry.key.equals(key))
            return entry.value;
    }
    return null;
}

What we see here is that the bucket indeed depends on both the .hashCode() returned from each key object and the current table size. And it will usually change. But only in cases where .hashCode() is different.

If you had an enormous table with 2^32 elements you could simply say bucketIndex = key.hashCode() and it would be as perfect as it can get. There is unfortunately not enough memory to do that so you have to use less buckets and map 2^32 hashes into just a few buckets. That's what indexFor essentially does. Mapping large number space into small one.

That is perfectly fine in the typical case where (almost) no object has the same .hashCode() of any other. But the one thing that you must not do with HashMaps is to add only elements with exactly the same hash.

If every hash is the same, your hash based lookup results in the same bucket and all your HashMap has become is a LinkedList (or whatever data structure holds the elements of a bucket). And now you have the worst case scenario of O(N) access time because you have to iterate over all the N elements.