Suppose I write a class, but don't define a __hash__
for it. Then __hash__(self)
defaults to id(self)
(self
's memory address), according to the documentation.
However I don't see in the documentation, how this value is being used.
So if my __hash__
was simply return 1
, which would cause the hash of all instances of my class to be the same, they all get bucketed into the same underlying hash bucket (which I assume is implemented in C). However, this does not mean that the return value of __hash__
is being used as the key to bin elements in this underlying hash table.
So really, my question is: what happens to the value returned by __hash__
? is it used as the key directly, or is its hash (or the result of some other computation performed on it) used as the key to the hash table?
In case it matters, I'm on python2.7
EDIT: To clarify, I'm not asking about how hash collisions are handled. In python, this seems to be done with linear chaining. Instead, I'm asking how the return value of __hash__
translates into the memory address (?) of the corresponding bucket.
Since Python's hash tables have a size that is a power-of-two, the lower bits of the hash value determine the location in the hash table (or at least the location of the initial probe).
The sequence of probes into a table size of n is given by:
def gen_probes(hashvalue, n):
'Same sequence of probes used in the current dictionary design'
mask = n - 1
PERTURB_SHIFT = 5
if hashvalue < 0:
hashvalue = -hashvalue
i = hashvalue & mask
yield i
perturb = hashvalue
while True:
i = (5 * i + perturb + 1) & 0xFFFFFFFFFFFFFFFF
yield i & mask
perturb >>= PERTURB_SHIFT
For example, the dictionary:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
is stored as an array of size 8 with each entry in the form (hash, key, value)
:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
The C source code for key insertion in Python's dictionaries can be found here: http://hg.python.org/cpython/file/cd87afe18ff8/Objects/dictobject.c#l550
When an object is stored in a dictionary, the __hash__
is used to determine the original bin that the object is placed in. However, that doesn't mean one object will get confused with another in the dictionary- they still check for object equality. It just means that the dictionary will be a bit slower in hashing that type of object than others.
Of course logically (from the view of code that uses the hash table) the object itself is the key. If you search for key "foo"
in the hash table, no matter what other objects in the hash table have the same hash value as "foo"
, the corresponding value will only be returned if one of the key-value pairs stored in the hash table has key equal to "foo"
.
I don't know exactly what Python does, but a hash table implementation has to account for hash collisions. If the hash table array has N
slots, then if you insert N + 1
values (and the table is not resized first), there must be a collision. Also, as in the case you mentioned where __hash__
always returns 1, or just as a quirk of the hash function implementation, it is possible to have two objects with exactly the same hash code.
There are two major strategies used to deal with hash collisions in a hash table for a single machine in memory (different techniques used for distributed hash tables, etc.):
- Each slot in the array is a list (typically a linked list), and all values that hash to
k
modulo N
are placed into the list at slot k
. So if hash values collide, that isn't a problem because both objects with the same hash value end up in the same list.
- Some kind of probing scheme. Basically, if the object you're inserting has hash value equal to
k
modulo N
, you look at slot k
. If it's full, you apply some formula to the current location (maybe just add 1), and look at the next slot. You follow a regular pattern to choose the next slot, given the original hash value and the number of probes so far, and keep probing until you find an open slot. This is less used, since if you aren't careful about your implementation you can run into clustering problems i.e. have to probe many many times before finding the object.
Wikipedia talks a lot more about hash table implementations here.