Remove identical files in UNIX

I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet). Would you suggest me an efficient way to do that? I'm working on unix.

标签： shell file unix duplicates

6条回答

Juvenile、少年°

2楼-- · 2019-02-25 07:13

There is an existing tool for this: fdupes

Restoring a solution from an old deleted answer.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-02-25 07:15

you can try this snippet to get all duplicates first before removing.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

0人赞添加讨论(0) 举报

做个烂人

4楼-- · 2019-02-25 07:16

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmp to check that the files are really identical.

0人赞添加讨论(0) 举报

冷血范

5楼-- · 2019-02-25 07:22

Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.

0人赞添加讨论(0) 举报

Fickle 薄情

6楼-- · 2019-02-25 07:30

I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.

0人赞添加讨论(0) 举报

趁早两清

7楼-- · 2019-02-25 07:33

Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum. If the MD5 is the same, then remove the file.

For example, if file b is a duplicate of file a, the md5sum will be the same for both the files.

0人赞添加讨论(0) 举报

Remove identical files in UNIX

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间