I was reading the javadocs on HashSet when I came across the interesting statement:
This class offers constant time performance for the basic operations (add, remove, contains and size)
This confuses me greatly, as I don't understand how one could possibly get constant time, O(1), performance for a comparison operation. Here are my thoughts:
If this is true, then no matter how much data I'm dumping into my HashSet, I will be able to access any element in constant time. That is, if I put 1 element in my HashSet, it will take the same amount of time to find it as if I had a googolplex of elements.
However, this wouldn't be possible if I had a constant number of buckets, or a consistent hash function, since for any fixed number of buckets, the number of elements in that bucket will grow linearly (albeit slowly, if the number is big enough) with the number of elements in the set.
Then, the only way for this to work is to have a changing hash function every time you insert an element (or every few times). A simple hash function that never any collisions would satisfy this need. One toy example for strings could be: Take the ASCII value of the strings and concatenate them together (because adding could result in a conflict).
However, this hash function, and any other hash function of this sort will likely fail for large enough strings or numbers etc. The number of buckets that you can form is immediately limited by the amount of stack/heap space you have, etc. Thus, skipping locations in memory can't be allowed indefinitely, so you'll eventually have to fill in the gaps.
But if at some point there's a recalculation of the hash function, this can only be as fast as finding a polynomial which passes through N points, or O(nlogn).
Thus arrives my confusion. While I will believe that the HashSet can access elements in O(n/B) time, where B is the number of buckets it has decided to use, I don't see how a HashSet could possibly perform add or get functions in O(1) time.
Note: This post and this post both don't address the concerns I listed..
The number of buckets is dynamic, and is approximately ~
2n
, wheren
is the number of elements in the set.Note that
HashSet
gives amortized and average time performance ofO(1)
, not worst case. This means, we can suffer anO(n)
operation from time to time.So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.
This costs
n
operations, and is done when number of elements in the set exceeds2n/2=n
, so it means, the average cost of this operation is bounded byn/n=1
, which is a constant.Additionally, the number of collisions a HashMap offers is also constant on average.
Assume you are adding an element
x
. The probability ofh(x)
to be filled up with one element is ~n/2n = 1/2
. The probability of it being filled up with 3 elements, is ~(n/2n)^2 = 1/4
(for large values ofn
), and so on and so on.This gives you an average running time of
1 + 1/2 + 1/4 + 1/8 + ...
. Since this sum converges to2
, it means this operation takes constant time on average.What I know about hashed structures is that to keep a O(1) complexity for insertion removal you need to have a good hash function to avoid collisions and the structure should not be full ( if the structure is full you will have collisions).
Normally hashed structures define a kind of fill limit, by example 70%. When the number of object make the structure be filled more than this limit than you should extend it size to stay below the limit and warranty performances. Generally you double the size of the structure when reaching the limit so that structure size grow faster than number of elements and reduce the number of resize/maintenance operations to perform
This is a kind of maintenance operation that consists on rehashing all elements contained int he structure to redistribute them in the resized structure. For sure this has a cost whose complexity is O(n) with n the number of elements stored in the structure but this cost is not integrated in the add function that will make the maintenance operation needed
I think this is what disturb you.
I learned also that the hash function generally depends on size of the structure that is used as parameter (there was something like max number of elements to reach the limit is a prime number of structure size to reduce the probability of collision or something like that) meaning that you don't change the hash function itself, you just change on of its parameters.
To answer to your comment there is not warranty if bucket 0 or 1 was filled that when you resize to 4 new elements will go inside bucket 3 and 4. Perhaps resizing to 4 make elements A and B now be in buckets 0 and 3
For sure all above is theorical and in real life you don`t have infinite memory, you can have collisions and maintenance has a cost etc so that's why you need to have an idea about the number of objects that you will store and do a trade off with available memory to try to choose an initial size of hashed structure that will limit the need to perform maintenance operations and allow you to stay in the O(1) performances