Strange python's hashlib.md5 behavior, differe

2019-03-02 02:42发布

问题:

I've faced some really strange behavior trying to calculate md5 hash of string. Returned hash is always wrong (and different) if I pass string that was result of concatenation. Only way to get real hash I've found is to pass string that wasn't modified in any way after creation.

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> m = hashlib.md5() 
>>> a1 = "stack"
>>> a2 = "overflow"
>>> a3 = a1 + a2
>>> a4 = str(a1 + a2)
>>> m.update("stackoverflow")
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc' //actuall hash
>>> m.update(a1 + a2)
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m.update(a3)
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m.update(a4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'acd4e14145d34dcb10af785badf8e73e'
>>> m.update(a1 + a2)
>>> m.hexdigest()
'03c06ca09faa26166f1096db02272b11'
>>> a1 + a2 == a1 + a2
True
>>> a1 + a2 == a3
True
>>> a3 == a4
True

Am I missing something?

回答1:

What you are missing is that hash.update() doesn't replace the hashed data. You are continually updating the hash object, so you are getting the hash of the concatenated strings. From the hashlib.hash.update() documentation:

Update the hash object with the string arg. Repeated calls are equivalent to a single call with the concatenation of all the arguments: m.update(a); m.update(b) is equivalent to m.update(a+b).

Bold emphasis mine.

So you are not getting the hash of a single 'stackoverflow' string, you are getting the hash first of 'stackoverflow', then of 'stackoverflowstackoverflow', then 'stackoverflowstackoverflowstackoverflow' etc., each time appending another 'stackoverflow' creating a longer and longer string. None of those longer strings are equal to the original short string so their hashes are not likely to be equal either.

Create a new object for new strings, instead:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update('stack' + 'overflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()   # **new** hash object
>>> m.update('stackoverflow')
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'
>>> m = hashlib.md5()     # new object again
>>> m.update('stack')     # add the string in pieces, part 1
>>> m.update('overflow')  # and part 2
>>> m.hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You can readily produce your 'wrong' hashes by sending in concatenated data:

>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflow')
>>> m.hexdigest()
'458b7358b9e0c3f561957b96e543c5a8'
>>> m = hashlib.md5()
>>> m.update('stackoverflowstackoverflowstackoverflow')
>>> m.hexdigest()
'65b0e62d4ff2d91e111ecc8f27f0e8f5'
>>> m = hashlib.md5()
>>> m.update('stackoverflow' * 4)
>>> m.hexdigest()
'60c3ae3dd9a2095340b2e024194bad3c'

Note that you can also pass in the first string into the md5() function:

>>> hashlib.md5('stackoverflow').hexdigest()
'73868cb1848a216984dca1b6b0ee37bc'

You normally use the hash.update() method only if you are processing data in chunks (like reading a file line by line or reading blocks of data from a socket), and don't want to have to hold all of that data in memory at once.



标签: python hash