Let's say I have a referentially transparent function. It is very easy to memoize it; for example:
def memoize(obj):
memo = {}
@functools.wraps(obj)
def memoizer(*args, **kwargs):
combined_args = args + (kwd_mark,) + tuple(sorted(kwargs.items()))
if combined_args not in memo:
memo[combined_args] = obj(*args, **kwargs)
return cache[combined_args]
return memoizer
@memoize
def my_function(data, alpha, beta):
# ...
Now suppose that the data
argument to my_function
is huge; say, it's a frozenset
with millions of elements. In this case, the cost of memoization is prohibitive: every time, we'd have to calculate hash(data)
as part of the dictionary lookup.
I can make the memo
dictionary an attribute to data
instead of an object inside memoize
decorator. This way I can skip the data
argument entirely when doing the cache lookup since the chance that another huge frozenset
will be the same is negligible. However, this approach ends up polluting an argument passed to my_function
. Worse, if I have two or more large arguments, this won't help at all (I can only attach memo
to one argument).
Is there anything else that can be done?
It turns out that the built-in
__hash__
is not that bad, since it caches its own value after the first calculation. The real performance hit comes from the built-in__eq__
, since it doesn't short-circuit on identical objects, and actually goes through the full comparison every time, making it very costly.One approach I thought of is to subclass the built-in class for all large arguments:
This way, dictionary lookup would be instantaneous. But the equality for the new class will be broken.
A better solution is probably this: only when the dictionary lookup is performed, the large arguments can be wrapped inside a special class that redefines
__eq__
and__hash__
to be return the wrapped object'sid()
. The obvious implementation of the wrapper is a bit annoying, since it requires copying all the standardfrozenset
methods. Perhaps deriving it from the relevantABC
class may make it easier.Well, you can use "hash" there with no fears. A frozenset's hash is not calculated more than once by Python - just when it is created - check the timings:
Don't forget Python's "hash" builtin calls the object's
__hash__
method, which has its return value defined at creation time for built-in hasheable objects. Above you can see that calling a identity lambda function is more than twice slower than calling "hash (a)"So, if all your arguments are hasheable, just add their hash when creating "combined_args" - else, just write its creation so that you use hash for frozenset (and maybe other) types, with a conditional.