Detecting duplicate files

2019-03-21 11:00发布

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

The duplicate means files having the same content which may differ in file names and path.

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

Which additional fast and reliable mechanism would you use?

7条回答
该账号已被封号
2楼-- · 2019-03-21 11:24

The hash solution is fine - you will just need to do one of the collisions mechanisms for dealing with 2 elements that are hashed to the same value. [chaining or open addressing].

Just iteratively add elements - if your implementation detected that there is a dupe - it will not add it to the hash set. You will know that an element is a dupe if the size of the set was not changed after trying to add the element.

Most likely that there is already an implementations for this kind of data structure, in your language - for example a HashSet in java and unordered_set in C++.

查看更多
劳资没心,怎么记你
3楼-- · 2019-03-21 11:28

If you use a hash algo like SHA-1 or better yet SHA-256 or higher, I really doubt if you will get the same hash value for two different files. SHA is a cryptographic hash function, and is used in version control systems like Git. So you can rest assured that it will do the job for you.

But if you still want additional checks in place, you can follow these two steps.
1) Parse the headers - this is a really tough cookie to crack since different formats might have different header lengths
2) Have some sanity checks - file size, read random file positions and try to check if they are the same

查看更多
我欲成王,谁敢阻挡
4楼-- · 2019-03-21 11:35

Here is a VBS script that will generate CSV file to show duplicate files in a folder based on MD5 file checksum and file size.

Set fso = CreateObject("Scripting.FileSystemObject")
Dim dic: Set dic = CreateObject("Scripting.Dictionary")
Dim oMD5:  Set oMD5 = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider")
Dim oLog 'As Scripting.TextStream

Set oArgs = WScript.Arguments

If oArgs.Count = 1 Then
    sFolderPath = GetFolderPath()
    Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True)
    oLog.Write "sep=" & vbTab & vbCrLf
    CheckFolder oArgs(I)
    oLog.Close
    Msgbox "Done!"
Else
    Msgbox "Drop Folder"
End If

Sub CheckFolder(sFolderPath)
    Dim sKey
    Dim oFolder 'As Scripting.Folder
    Set oFolder = fso.GetFolder(sFolderPath)

    For Each oFile In oFolder.Files
        'sKey = oFile.Name & " - " & oFile.Size
        sKey = GetMd5(oFile.Path) & " - " & oFile.Size

        If dic.Exists(sKey) = False Then 
            dic.Add sKey, oFile.Path
        Else
            oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf
        End If
    Next

    For Each oChildFolder In oFolder.SubFolders
        CheckFolder oChildFolder.Path
    Next
End Sub

Function GetFolderPath()
    Dim oFile 'As Scripting.File
    Set oFile = fso.GetFile(WScript.ScriptFullName)
    GetFolderPath = oFile.ParentFolder
End Function

Function GetMd5(filename)
    Dim oXml, oElement

    oMD5.ComputeHash_2(GetBinaryFile(filename))

    Set oXml = CreateObject("MSXML2.DOMDocument")
    Set oElement = oXml.CreateElement("tmp")
    oElement.DataType = "bin.hex"
    oElement.NodeTypedValue = oMD5.Hash
    GetMd5 = oElement.Text
End Function

Function GetBinaryFile(filename)
    Dim oStream: Set oStream = CreateObject("ADODB.Stream")
    oStream.Type = 1 'adTypeBinary
    oStream.Open
    oStream.LoadFromFile filename
    GetBinaryFile= oStream.Read
    oStream.Close
    Set oStream = Nothing
End Function

Here is a VBS script that will generate CSV file to show duplicate files in a folder based on file name and size.

Set fso = CreateObject("Scripting.FileSystemObject")
Dim dic: Set dic = CreateObject("Scripting.Dictionary")
Dim oLog 'As Scripting.TextStream

Set oArgs = WScript.Arguments

If oArgs.Count = 1 Then
    sFolderPath = GetFolderPath()
    Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True)
    oLog.Write "sep=" & vbTab & vbCrLf
    CheckFolder oArgs(I)
    oLog.Close
    Msgbox "Done!"
Else
    Msgbox "Drop Folder"
End If

Sub CheckFolder(sFolderPath)
    Dim sKey
    Dim oFolder 'As Scripting.Folder
    Set oFolder = fso.GetFolder(sFolderPath)

    For Each oFile In oFolder.Files
        sKey = oFile.Name & " - " & oFile.Size

        If dic.Exists(sKey) = False Then 
            dic.Add sKey, oFile.Path
        Else
            oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf
        End If
    Next

    For Each oChildFolder In oFolder.SubFolders
        CheckFolder oChildFolder.Path
    Next
End Sub

Function GetFolderPath()
    Dim oFile 'As Scripting.File
    Set oFile = fso.GetFile(WScript.ScriptFullName)
    GetFolderPath = oFile.ParentFolder
End Function
查看更多
Summer. ? 凉城
5楼-- · 2019-03-21 11:38

Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.

There can be more steps.

  1. Check if file size is equal
  2. If step 1 passes, check if first and last range of bytes (say 100 bytes) are equal
  3. If step 2 passes, check file type,
  4. If step 3 passes, check the hash at last

The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.

查看更多
冷血范
6楼-- · 2019-03-21 11:39

It would depend on the files you're comparing.

A) The worst-case scenario is:

  1. You have a lot of files which are the same size
  2. The files are very large
  3. The files are very similar with differences found in a narrow random location in the file

For example, if you had:

  • 100x 2MB files of the same size,
  • comparison with each other,
  • using binary comparison with
  • 50% file read (probability of finding unequal byte in the first half of the file)

Then you would have:

  • 10,000 comparisons of
  • 1MB which equals
  • a total of 10GB of reading.

However, if you had the same scenario but derived the hashes of the files first, you would:

  • read 200MB of data from disk ( typically the slowest component in a computer ) distilling to
  • 1.6K in memory (using MD5 hasing - 16 byte - security is not important)
  • and would read 2N*2MB for final direct binary comparison, where N is the number of duplicates found.

I think this worst-case scenario is not typical though.

B) Typical case scenario is:

  1. Files are usually different in size
  2. The files are highly likely to differ near the start of the file - this means direct binary comparison does not typically involve reading the whole file on the bulk of differing files of the same size.

For example, if you had:

  • A folder of MP3 files (they don't get too big - maybe no bigger than 5MB)
  • 100 files
  • checking size first
  • at most 3 files the same size (duplicates or not)
  • using binary comparison for files of the same size
  • 99% likely to be different after 1KBytes

Then you would have:

  • At most 33 cases where the length is the same in 3 file sets
  • Parallel binary reading of 3 files (or more is possible) concurrently in 4K chunks
  • With 0% duplicates found - 33 * 3 * 4K of reading files = 396KB of disk reading
  • With 100% multiplies found = 33 * 3 * N, where N is file size (~5MB) = ~495MB

If you expect 100% multiples, hashing won't be any more efficient than direct binary comparison. Given that you should expect <100% multiples, hashing would be less efficient than direct binary comparison.

C) Repeated comparison

This is the exception. Building a hash+length+path database for all files will accelerate repeated comparisons. But the benefits would be marginal. It will require 100% reading of files initially and storage of the hash database. The new file will need to be read 100% then added to the database, and if matched will still require direct binary comparison as a final step of comparison (to rule out hash collision). Even if most files are different sizes, when a new file is created in the target folder, it may match an existing file size, and can be quickly excluded from direct comparison.

To conclude:

  • No additional hashes should be used (the ultimate test - binary comparison - should always be the final test)
  • Binary comparison is often more efficient on first run when there are many different sized files
  • MP3 comparison works well with length then binary comparison.
查看更多
男人必须洒脱
7楼-- · 2019-03-21 11:45
/// ------------------------------------------------------------------------------------------------------------------------
    /// <summary>
    /// Writes duplicate files to a List<String>
    /// </summary>
    private void CompareDirectory(string[] files)
    {
        for (int i = 0; i < files.Length; i++)
        {
            FileInfo one = new FileInfo(files[i]); // Here's a spot for a progressbar or something

            for (int i2 = 0; i2 < files.Length; i2++) 
            {
                if (i != i2 && !duplicatePathsOne.Contains(files[i2])) // In order to prevent duplicate entries
                {
                    FileInfo two = new FileInfo(files[i2]);
                    if (FilesAreEqual_OneByte(one, two))
                    {
                        duplicatePathsOne.Add(files[i]);
                        duplicateNamesOne.Add(Path.GetFileName(files[i]));
                        duplicatePathsTwo.Add(files[i2]);
                        duplicateNamesTwo.Add(Path.GetFileName(files[i2]));
                    }
                }
            }
        }
    }


/// ------------------------------------------------------------------------------------------------------------------------
    /// <summary>
    /// Compares files by binary
    /// </summary>
    private static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }
查看更多
登录 后发表回答