I am reading the source code of pyspark here. I got really confused by the portable function here.
def portable_hash(x):
"""
This function returns consistant hash code for builtin types, especially
for None and tuple with None.
The algrithm is similar to that one used by CPython 2.7
>>> portable_hash(None)
0
>>> portable_hash((None, 1)) & 0xffffffff
219750521
"""
if x is None:
return 0
if isinstance(x, tuple):
h = 0x345678
for i in x:
h ^= portable_hash(i)
h *= 1000003
h &= sys.maxint
h ^= len(x)
if h == -1:
h = -2
return h
return hash(x)
I can see that it is a recursive function. If the input is a tuple, then recursively loop through every element.
Here are few of my questions:
- is this hash approach an one to one mapping?
- this function only takes None and tuple and hashable values into consideration, I know that list object is not hashable, did they do that intentionally or not.
- I don't have much experience with hash, is this a very classic hashing approach, if so, is there any resource for me to get a better understanding for it?
NO hashing approach is 1-to-1: they all map M possible inputs into N possible integer results where M is much larger than N.
Yes, this function delegates to built-in
hash
everything else except tuples andNone
. And it's definitely a deliberate design decision in Python (also respected by this function) to make mutable built-ins such aslist
anddict
not hashable.Yes, exclusive-or'ing hash of items, with modification of the running total as one goes through, is indeed a very classic approach to hashing containers of hashable items.
For more study about hashing, I'd start at http://en.wikipedia.org/wiki/Hash_function .