Remove identical files in UNIX

2019-02-25 06:54发布

I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.

6条回答
Juvenile、少年°
2楼-- · 2019-02-25 07:13

There is an existing tool for this: fdupes

Restoring a solution from an old deleted answer.

查看更多
放荡不羁爱自由
3楼-- · 2019-02-25 07:15

you can try this snippet to get all duplicates first before removing.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}' 
查看更多
做个烂人
4楼-- · 2019-02-25 07:16

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmp to check that the files are really identical.

查看更多
冷血范
5楼-- · 2019-02-25 07:22

Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.

查看更多
Fickle 薄情
6楼-- · 2019-02-25 07:30

I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.

查看更多
趁早两清
7楼-- · 2019-02-25 07:33

Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum. If the MD5 is the same, then remove the file.

For example, if file b is a duplicate of file a, the md5sum will be the same for both the files.

查看更多
登录 后发表回答