I've faced some really strange behavior trying to calculate md5 hash of string. Returned hash is always wrong (and different) if I pass string that was result of concatenation. Only way to get real hash I've found is to pass string that wasn't modified in any way after creation.
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> m = hashlib.md5()
>>> a1 = "stack"
>>> a2 = "overflow"
>>> a3 = a1 + a2
>>> a4 = str(a1 + a2)
>>> m.update("stackoverflow")
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc' //actuall hash
>>> m.update(a1 + a2)
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m.update(a3)
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m.update(a4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'acd4e14145d34dcb10af785badf8e73e'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'03c06ca09faa26166f1096db02272b11'
>>> a1 + a2 == a1 + a2
True
>>> a1 + a2 == a3
True
>>> a3 == a4
True
Am I missing something?
What you are missing is that
hash.update()
doesn't replace the hashed data. You are continually updating the hash object, so you are getting the hash of the concatenated strings. From thehashlib.hash.update()
documentation:Bold emphasis mine.
So you are not getting the hash of a single
'stackoverflow'
string, you are getting the hash first of'stackoverflow'
, then of'stackoverflowstackoverflow'
, then'stackoverflowstackoverflowstackoverflow'
etc., each time appending another'stackoverflow'
creating a longer and longer string. None of those longer strings are equal to the original short string so their hashes are not likely to be equal either.Create a new object for new strings, instead:
You can readily produce your 'wrong' hashes by sending in concatenated data:
Note that you can also pass in the first string into the
md5()
function:You normally use the
hash.update()
method only if you are processing data in chunks (like reading a file line by line or reading blocks of data from a socket), and don't want to have to hold all of that data in memory at once.