I am trying to understand python hash function under the hood. I created a custom class where all instances return the same hash value.
class C(object):
def __hash__(self):
return 42
I just assumed that only one instance of the above class can be in a set at any time, but in fact a set can have multiple elements with same hash.
c, d = C(), C()
x = {c: 'c', d: 'd'}
print x
# {<__main__.C object at 0x83e98cc>:'c', <__main__.C object at 0x83e98ec>:'d'}
# note that the dict has 2 elements
I experimented a little more and found that if I override the __eq__
method such that all the instances of the class compare equal, then the set only allows one instance.
class D(C):
def __eq__(self, other):
return hash(self) == hash(other)
p, q = D(), D()
y = {p:'p', q:'q'}
print y
# {<__main__.D object at 0x8817acc>]: 'q'}
# note that the dict has only 1 element
So I am curious to know how can a dict have multiple elements with same hash. Thanks!
Note: Edited the question to give example of dict (instead of set) because all the discussion in the answers is about dicts. But the same applies to sets; sets can also have multiple elements with same hash value.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
O(1)
lookup by index).The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
i
that is based on the hash of the key. CPython uses initiali = hash(key) & mask
. Wheremask = PyDictMINSIZE - 1
, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.<hash|key|value>
). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)==
comparison not theis
comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (
==
) of the keys when inserting items. So in summary, if there are two keys,a
andb
andhash(a)==hash(b)
, buta!=b
, then both can exist harmoniously in a Python dict. But ifhash(a)==hash(b)
anda==b
, then they cannot both be in the same dict.Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
Hash tables, in general have to allow for hash collisions! You will get unlucky and two things will eventually hash to the same thing. Underneath, there is a set of objects in a list of items that has that same hash key. Usually, there is only one thing in that list, but in this case, it'll keep stacking them into the same one. The only way it knows they are different is through the equals operator.
When this happens, your performance will degrade over time, which is why you want your hash function to be as "random as possible".
Edit: the answer below is one of possible ways to deal with hash collisions, it is however not how Python does it. Python's wiki referenced below is also incorrect. The best source given by @Duncan below is the implementation itself: http://svn.python.org/projects/python/trunk/Objects/dictobject.c I apologize for the mix-up.
It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:
Here you see
John Smith
andSandra Dee
both hash to152
. Bucket152
contains both of them. When looking upSandra Dee
it first finds the list in bucket152
, then loops through that list untilSandra Dee
is found and returns521-6955
.The following is wrong it's only here for context: On Python's wiki you can find (pseudo?) code how Python performs the lookup.
There's actually several possible solutions to this problem, check out the wikipedia article for a nice overview: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash doesn't match or the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.
So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary. In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.
Output