I need to get a base64-encoded MD5 hash of an object, where the object is an image stored as a file, fname.
I've tried this:
def get_md5(fname):
hash = hashlib.md5()
with open(fname) as f:
for chunk in iter(lambda: f.read(4096), ""):
hash.update(chunk)
return hash.hexdigest().encode('base64').strip()
However, I don't think this is right because it returns a string with too many characters. My understanding is that it needs to be 24 characters long. I get
NjJiM2RlOWMzOTYxYmM3MDI5Y2Q1NzdjOTQ5YWRlYTQ=
I've tried a few other similar ways as well, for example, one that does not do the chunk loop thing. They all return the same string.
(My later actions that need the base64-encoded MD5 hash fail, and I'm thinking this could be why.)
I was able to make it work by using digest() instead of hexdigest(). Then the last line becomes:
return hash.digest().encode('base64').strip()
The result was then 24 characters long, and it was accepted by Google Cloud Storage transfer, which required a base64-encoded MD5 hash.
First, base64 encoding makes strings longer. (Example using IPython with Python 3):
In [1]: s = '123456789012345678901234'
In [2]: len(s)
Out[2]: 24
In [3]: import base64
In [4]: e = base64.b64encode(s.encode('utf8'))
In [5]: len(e)
Out[5]: 32
In [6]: e
Out[6]: b'MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0'
With base64 encoding you get 8 bits of output for every 6 bits of input.
In [7]: 32/24
Out[7]: 1.333
In [8]: 8/6
Out[8]: 1.333
The base64 alphabet uses 64 (or 2**6) different symbols.
Generally they include lower- and uppercase letters, the digits 0-9. This leaves two extra required symbols and a pading character.
Often +
and /
are used as symbols, but there are variations. Especially since /
is not allowed in UNIX or MS-Windows filenames.
Second, using a hexadecimal representation doubles the length of a byte string; the hex representation of one byte can vary between 00 and FF. Example (again using IPython and Python 3):
In [1]: import hashlib
In [2]: s = b'this is a simple test'
In [3]: len(hashlib.md5(s).digest())
Out[3]: 16
In [4]: len(hashlib.md5(s).hexdigest())
Out[4]: 32
If you are going to use base64 encoding anyway, it makes no sense to use hexdigest()
.