可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

Hash of None varies per machine
Pickle.dumps not suitable for hashing

What would be the best method to calculate a hash for a Python class containing Numpy arrays?

回答1:

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

回答2:

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

回答3:

There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.

回答4:

Here is how I do it in jug (git HEAD at the time of this answer):

e = some_array_object
M = hashlib.md5()
M.update('np.ndarray')
M.update(pickle.dumps(e.dtype))
M.update(pickle.dumps(e.shape))
try:
    buffer = e.data
    M.update(buffer)
except:
    M.update(e.copy().data)

The reason is that e.data is only available for some arrays (contiguous arrays). Same thing with a.view(np.uint8) (which fails with a non-descriptive type error if the array is not contiguous).

回答5:

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)

回答6:

Fastest by some margin seems to be:

hash(iter(a))

a is a numpy ndarray.

Obviously not secure hashing, but it should be good for caching etc.

回答7:

Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use numpy.ascontiguousarray() if not), e.g.

>>> h = hashlib.md5()
>>> arr = numpy.arange(101)
>>> h.update(arr)
>>> print(h.hexdigest())
e62b430ff0f714181a18ea1a821b0918

How to hash a large object (dataset) in Python?

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

How to hash a large object (dataset) in Python?

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮