I'm using this code to calculate hash value for a file:
m = hashlib.md5()
with open("calculator.pdf", 'rb') as fh:
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
hash_value = m.hexdigest()
print hash_value
when I tried it on a folder "folder"I got
IOError: [Errno 13] Permission denied: folder
How could I calculate the hash value for a folder ?
I keep seeing this code propagated through various forums.
The ActiveState recipe answer works but, as Antonio pointed out, it is not guaranteed to be repeatable across filesystems, due to not being able to present the files in the same order (try it). One fix is to change
to
(Yes I'm being lazy here. This sorts the filenames only and not the directories. The same principle applies)
I'm not a fan of how the recipe referenced in the answer was written. I have a much simpler version that I'm using:
I found exceptions were usually being thrown whenever something like an
alias
was encountered (shows up in theos.walk()
, but you can't directly open it). Theos.path.isfile()
check takes care of those issues.If there were to be an actual file within a directory I'm attempting to hash and it couldn't be opened, skipping that file and continuing is not a good solution. That affects the outcome of the hash. Better to kill the hash attempt altogether. Here, the
try
statement would be wrapped around the call to myhash_directory()
function.I have optimized further on Andy's response.
The following is a python3 rather than python2 implementation. It uses SHA1, handles some cases where encoding is needed, is linted, and includes some doctrings.
This Recipe provides a nice function to do what you are asking. I've modified it to use the MD5 hash, instead of the SHA1, as your original question asks
You can use it like this:
The output looks like this, as it hashes each file:
The returned value from this function call comes back as the hash. In this case,
5be45c5a67810b53146eaddcae08a809
Here is an implementation that uses pathlib.Path instead of relying on os.walk. It sorts the directory contents before iterating so it should be repeatable on multiple platforms. It also updates the hash with the names of files/directories, so adding empty files and directories will change the hash.
Version with type annotations (Python 3.6 or above):
Without type annotations:
Condensed version if you only need to hash directories:
Usage:
md5_hash = md5_dir("/some/directory")
Use checksumdir python package available for calculating checksum/hash of directory. It's available at https://pypi.python.org/pypi/checksumdir/1.0.5
Usage :