I'm running Python 3.5.1 on Windows. I am attempting to find duplicate source code files in a directory by computing their hash. The problem is that Python seems to think some files are empty. Here is the relevant code snippet:
with open(path, 'rb') as afile:
hasher = hashlib.md5()
data = afile.read()
hasher.update(data)
print("len(data): {}, Path: {}, Hash:{}".format(len(data), path, hasher.hexdigest()))
Here is some example output:
len(data): 0, Path: h:\t\TCPServerSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.cpp, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 5073, Path: h:\t\ConfigFile.cpp, Hash:6188d6a0e0bc02edf27ce232689beff6
I assure you that these files are not empty, and Python is not throwing any errors during execution. Any ideas?
I think you should computer the hash by calling hashlib.md5 on the files them self
Let me know if that continues to suggest files are empty
I'll just delete this answer if it is not the case, but it's something you need to check. Put this directly before the open block
Because everything in your output is consistent with that file actually being empty. So maybe it is? My first thought was you might be zeroing out the file elsewhere, although you would figure that out pretty quickly.