I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5
or sha1
.
The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__()
member. Currently I do a pickle.dumps()
for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:
What would be the best method to calculate a hash for a Python class containing Numpy arrays?
Fastest by some margin seems to be:
a is a numpy ndarray.
Obviously not secure hashing, but it should be good for caching etc.
Here is how I do it in jug (git HEAD at the time of this answer):
The reason is that
e.data
is only available for some arrays (contiguous arrays). Same thing witha.view(np.uint8)
(which fails with a non-descriptive type error if the array is not contiguous).Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use
numpy.ascontiguousarray()
if not), e.g.There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.
Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:
I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:
array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)