I have a class (let's call it myClass
) that implements both __hash__
and __eq__
. I also have a dict
that maps myClass
objects to some value, computing which takes some time.
Over the course of my program, many (in the order of millions) myClass
objects are instantiated. This is why I use the dict
to keep track of those values.
However, sometimes a new myClass
object might be equivalent to an older one (as defined by the __eq__
method). So rather than compute the value for that object again, I'd rather just lookup the value of older myClass
object in the dict
. To accomplish this, I do if myNewMyClassObj in dict
.
Here's my question:
When I use that in
clause, what gets called, __hash__
or __eq__
? The point of using a dict
is that it's O(1) lookup time. So then __hash__
must be called. But what if __hash__
and __eq__
aren't equivalent methods? In that case, will I get a false positive for if myNewMyClassObj in dict
?
Follow up question:
I want to minimize the number of entries in my dict
, so I would ideally like to keep only one of a set of equivalent myClass
objects in the dict
. So again, it seems that __eq__
needs to be called when computing if myNewClassObj in dict
, which would defile a dict
's O(1) lookup time to an O(n) lookup time
First, __hash__(myNewMyClassObj)
gets called. If no object with the same hash is found in the dictionary, Python assumes myNewMyClassObj
is not in the dictionary. (Note that Python requires that whenever __eq__
evaluates as equal for two objects, their __hash__
must be identical.)
If some objects with the same __hash__
are found in the dictionary, __eq__
gets called on each of them. If __eq__
evaluates as equal for any of them, the myNewMyClassObj in dict_
returns True.
Thus, you just need to make sure both __eq__
and __hash__
are fast.
To your follow up question: yes, dict_
stores only one of a set of equivalent MyClass
objects (as defined by __eq__
). (As does set.)
Note that __eq__
is only called on the objects that had the same hash and got allocated to the same bucket. The number of such objects is usually a very small number (dict
implementation makes sure of that). So you still have (roughly) O(1)
lookup performance.
__hash__
will always be called; __eq__
will be called if the object is indeed in the dictionary, or if another object with the same hash is in the dictionary. The hash value is used to narrow down the choice of possible keys. The keys are grouped into "buckets" by hash value, but for lookup Python still has to check each key in the bucket for equality with the lookup key. See http://wiki.python.org/moin/DictionaryKeys . Look at these examples:
>>> class Foo(object):
... def __init__(self, x):
... self.x = x
...
... def __hash__(self):
... print "Hash"
... return hash(self.x)
...
... def __eq__(self, other):
... print "Eq"
... return self.x == other.x
>>> Foo(1) in d
Hash
Eq
10: True
>>> Foo(2) in d
Hash
Eq
11: True
>>> Foo(3) in d
Hash
Eq
12: True
>>> Foo(4) in d
Hash
13: False
In that example, you can see __hash__
is always called. __eq__
is called once for each lookup when the object is in the dict, because they all have distinct hash values, so one equality check is enough to verify that the object with that hash value is indeed the one being queried. __eq__
is not called in the last case, because none of the objects in the dict have the same hash value as Foo(4)
, so Python doesn't need to continue with the __eq__
.
>>> class Foo(object):
... def __init__(self, x):
... self.x = x
...
... def __hash__(self):
... print "Hash"
... return 1
...
... def __eq__(self, other):
... print "Eq"
... return self.x == other.x
>>> d = {Foo(1): 2, Foo(2): 3, Foo(3): 4}
Hash
Hash
Eq
Hash
Eq
Eq
>>> Foo(1) in d
Hash
Eq
18: True
>>> Foo(2) in d
Hash
Eq
Eq
19: True
>>> Foo(3) in d
Hash
Eq
Eq
Eq
20: True
>>> Foo(4) in d
Hash
Eq
Eq
Eq
21: False
In this version, all objects have the same hash value. In this case __eq__
is always called, sometimes multiple times, because the hash doesn't distinguish between the values, so Python needs to explicitly check equality against all values in the dict until it finds an equal one (or finds that none of them equal the one it's looking for). Sometimes it finds it on the first try (Foo(1) in dict
above), sometimes it has to check all the values.
__hash__ defines the bucket the object is put into, __eq__ gets called only when objects are in the same bucket.