I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.
The duplicate means files having the same content which may differ in file names and path.
I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.
Which additional fast and reliable mechanism would you use?
This is the typical output of a md5sum:
If you don't have to fear intentionally faked files, the chances to for a second, random file to match is
If you take the file size into account as additional test, your certainty increases, that both files fit. You might add more and more measurement, but a bitwise comparison will be the last word in such a debate. For practical purpose, md5sum should be enough.