I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.
The duplicate means files having the same content which may differ in file names and path.
I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.
Which additional fast and reliable mechanism would you use?
The hash solution is fine - you will just need to do one of the collisions mechanisms for dealing with 2 elements that are hashed to the same value. [chaining or open addressing].
Just iteratively add elements - if your implementation detected that there is a dupe - it will not add it to the hash set. You will know that an element is a dupe if the size of the set was not changed after trying to add the element.
Most likely that there is already an implementations for this kind of data structure, in your language - for example a HashSet in java and unordered_set in C++.
If you use a hash algo like SHA-1 or better yet SHA-256 or higher, I really doubt if you will get the same hash value for two different files. SHA is a cryptographic hash function, and is used in version control systems like Git. So you can rest assured that it will do the job for you.
But if you still want additional checks in place, you can follow these two steps.
1) Parse the headers - this is a really tough cookie to crack since different formats might have different header lengths
2) Have some sanity checks - file size, read random file positions and try to check if they are the same
Here is a VBS script that will generate CSV file to show duplicate files in a folder based on MD5 file checksum and file size.
Here is a VBS script that will generate CSV file to show duplicate files in a folder based on file name and size.
Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.
There can be more steps.
The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.
It would depend on the files you're comparing.
A) The worst-case scenario is:
For example, if you had:
Then you would have:
However, if you had the same scenario but derived the hashes of the files first, you would:
I think this worst-case scenario is not typical though.
B) Typical case scenario is:
For example, if you had:
Then you would have:
If you expect 100% multiples, hashing won't be any more efficient than direct binary comparison. Given that you should expect <100% multiples, hashing would be less efficient than direct binary comparison.
C) Repeated comparison
This is the exception. Building a hash+length+path database for all files will accelerate repeated comparisons. But the benefits would be marginal. It will require 100% reading of files initially and storage of the hash database. The new file will need to be read 100% then added to the database, and if matched will still require direct binary comparison as a final step of comparison (to rule out hash collision). Even if most files are different sizes, when a new file is created in the target folder, it may match an existing file size, and can be quickly excluded from direct comparison.
To conclude: