I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.
相关问题
- How to get the return code of a shell script in lu
- What is the best way to do a search in a large fil
- Removing duplicate dataframes in a list
- Spring Integration - Inbound file endpoint. How to
- Django distinct is not working
相关文章
- 使用2台跳板机的情况下如何使用scp传文件
- In IntelliJ IDEA, how can I create a key binding t
- shell中反引号 `` 赋值变量问题
- How get the time in milliseconds in FreeBSD?
- What is the correct way to declare and use a FILE
- Making new files automatically executable?
- Reverse four length of letters with sed in unix
- Launch interactive SSH bash session from PHP
There is an existing tool for this: fdupes
Restoring a solution from an old deleted answer.
you can try this snippet to get all duplicates first before removing.
Find possible duplicate files:
Now you can use
cmp
to check that the files are really identical.Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.
I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.
For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.
Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command
md5sum
. If the MD5 is the same, then remove the file.For example, if file
b
is a duplicate of filea
, themd5sum
will be the same for both the files.