Detecting duplicate files

2019-03-21 11:00发布

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

The duplicate means files having the same content which may differ in file names and path.

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

Which additional fast and reliable mechanism would you use?

7条回答
混吃等死
2楼-- · 2019-03-21 11:47

This is the typical output of a md5sum:

0c9990e3d02f33d1ea2d63afb3f17c71

If you don't have to fear intentionally faked files, the chances to for a second, random file to match is

1/(decimal(0xfffffffffffffffffffffffffffffff)+1)

If you take the file size into account as additional test, your certainty increases, that both files fit. You might add more and more measurement, but a bitwise comparison will be the last word in such a debate. For practical purpose, md5sum should be enough.

查看更多
登录 后发表回答