This is a very general computer science based question, but one that doesn't seem intuitive based on the literature about how they work. It's a language agnostic question, but relates to how a Set data type works internally.
I have used them many times, and it is recommended to use them to store unique values and quickly access them. Supposedly in Big-O notation its time and complexity is O(1) every time the Set is accessed. How is that possibly if the Set may contain thousands of items? Even if the items are unique.
In order to find an item in the Set, it still has to scan each and every unique item, which in Big-O is O(n) in time and complexity. Is there anything I'm missing here?
Thanks in advance for your help! Most thorough answer gets the up-vote!
A
Set
is an example of a more general kind of objects collectively known asHashedCollections
. These use some sort ofHashTable
to actually store and retrieve their elements.Given any
element
, these tables compute an integer value for it named itshash
. There are several well-known techniques to define the mapping between elements and theirhash
values. Some are intrinsic, in the sense that thehash
does not depend on the attributes of theelement
, which might change, and hence thehash
remains the same along the life of theelement
. Others are extrinsic, in the sense that they may depend on the attributes. However, in the latter case, it is supposed that the particular elements will not be modified while they are referenced from aHashedCollection
(otherwise theHashedCollection
must berehashed
).The procedure for storing an
element
works as follows:hash
is computed for theelement
.index
to the table is computed as the remainder of thehash
modulo thelength
of the table.index
so calculated is already taken, some policy is applied to resolve the collision.Step 1 is supposed to be really fast (e.g., the
hash
has notcryptographic
strength).Step 2 assumes (in most of the cases) that the length of the table is a prime number (powers of
2
are also used)Step 3 can be resolved in basically two different ways:
j
times until the slot atindex + j
is free, orindex
(bucket)In addition, if there aren't enough empty slots (which increases the probability of collision), the table is enlarged and
rehashed
(because themodulo
changed).With sufficient free slots and a fairly random distribution of the indexing mechanism, the probability of finding the desired slot in
O(1)
is very high. Of course, if too many elements collide, the average complexity is no longerO(1)
, but this is supposedly mitigated by the growing policy (+rehash
).The retrieval is similar. To check whether an
element
belongs in the collection, itshash
andmodulo
are computed and theelement
is compared with the contents of the target slot. If the comparison fails, the search proceeds linearly in the bucket.The removal of elements is somewhat more difficult when there is no
bucket
and instead theindexes
are increased, but you get the idea.If you really want to see all of this at work, go ahead and debug the basic operations of
HashedCollections
in any Smalltalk dialect. Lots of fun guaranteed.