For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items())))
(sha1()
is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash()
:
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
Using sorted(d.items())
isn't enough to get us a stable repr. Some of the values in d
could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators
and ensure_ascii
arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash()
function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__
attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__
property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash()
will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable()
will convert the object into nested tuples and make_hash_sha256()
will also convert the repr()
to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
To preserve key order, instead of hash(str(dictionary))
or hash(json.dumps(dictionary))
I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime
and more that are not JSON serializable.
You could use the third-party frozendict
module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch
(Python 3.7):
@functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
@make_hashable.register
def _(x: collections.abc.Hashable):
return x
@make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
@make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
@make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps
, just do:
pip install maps
It handles the nested dict
case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps
library.
The general approach is fine, but you may want to consider the hashing method.
SHA was designed for cryptographic strength (speed too, but strength is more important). You may want to take this into account. Therefore, using the built-in hash
function is probably a good idea, unless security is somehow key here.