I'd like to create a hashlib
instance, update()
it, then persist its state in some way. Later, I'd like to recreate the object using this state data, and continue to update()
it. Finally, I'd like to get the hexdigest()
of the total cumulative run of data. State persistence has to survive across multiple runs.
Example:
import hashlib
m = hashlib.sha1()
m.update('one')
m.update('two')
# somehow, persist the state of m here
#later, possibly in another process
# recreate m from the persisted state
m.update('three')
m.update('four')
print m.hexdigest()
# at this point, m.hexdigest() should be equal to hashlib.sha1().update('onetwothreefour').hextdigest()
EDIT:
I did not find a good way to do this with python in 2010 and ended up writing a small helper app in C to accomplish this. However, there are some great answers below that were not available or known to me at the time.
I was facing this problem too, and found no existing solution, so I ended up writing a library that does something very similar to what Devesh Saini described: https://github.com/kislyuk/rehash. Example:
You can easily build a wrapper object around the hash object which can transparently persist the data.
The obvious drawback is that it needs to retain the hashed data in full in order to restore the state - so depending on the data size you are dealing with, this may not suit your needs. But it should work fine up to some tens of MB.
Unfortunattely the hashlib does not expose the hash algorithms as proper classes, it rathers gives factory functions that construct the hash objects - so we can't properly subclass those without loading reserved symbols - a situation I'd rather avoid. That only means you have to built your wrapper class from the start, which is not such that an overhead from Python anyway.
here is a sample code that might even fill your needs:
You can access the "data" member itself to get and set the state straight, or you can use python pickling functions:
hashlib.sha1 is a wrapper around a C library so you won't be able to pickle it.
It would need to implement the
__getstate__
and__setstate__
methods for Python to access its internal stateYou could use a pure Python implementation of sha1 if it is fast enough for your requirements
Hash algorithm for dynamic growing/streaming data?
You can do it this way using
ctypes
, no helper app in C is needed:-rehash.py
resumable_SHA-256.py
demo
output
Note: I would like to thank PM2Ring for his wonderful code.