How to compare 2 files fast using .NET?

2019-01-02 19:42发布

Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.

  • Would a checksum comparison such as CRC be faster?
  • Are there any .NET libraries that can generate a checksum for a file?

18条回答
裙下三千臣
2楼-- · 2019-01-02 20:28

If the files are not too big, you can use:

public static byte[] ComputeFileHash(string fileName)
{
    using (var stream = File.OpenRead(fileName))
        return System.Security.Cryptography.MD5.Create().ComputeHash(stream);
}

It will only be feasible to compare hashes if the hashes are useful to store.

(Edited the code to something much cleaner.)

查看更多
长期被迫恋爱
3楼-- · 2019-01-02 20:31

In addition to Reed Copsey's answer:

  • The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.

  • If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.

For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.

查看更多
无与为乐者.
4楼-- · 2019-01-02 20:33

Honestly, I think you need to prune your search tree down as much as possible.

Things to check before going byte-by-byte:

  1. Are sizes the same?
  2. Is the last byte in file A different than file B

Also, reading large blocks at a time will be more efficient since drives read sequential bytes more quickly. Going byte-by-byte causes not only far more system calls, but it causes the read head of a traditional hard drive to seek back and forth more often if both files are on the same drive.

Read chunk A and chunk B into a byte buffer, and compare them (do NOT use Array.Equals, see comments). Tune the size of the blocks until you hit what you feel is a good trade off between memory and performance. You could also multi-thread the comparison, but don't multi-thread the disk reads.

查看更多
无与为乐者.
5楼-- · 2019-01-02 20:33

I think there are applications where "hash" is faster than comparing byte by byte. If you need to compare a file with others or have a thumbnail of a photo that can change. It depends on where and how it is using.

private bool CompareFilesByte(string file1, string file2)
{
    using (var fs1 = new FileStream(file1, FileMode.Open))
    using (var fs2 = new FileStream(file2, FileMode.Open))
    {
        if (fs1.Length != fs2.Length) return false;
        int b1, b2;
        do
        {
            b1 = fs1.ReadByte();
            b2 = fs2.ReadByte();
            if (b1 != b2 || b1 < 0) return false;
        }
        while (b1 >= 0);
    }
    return true;
}

private string HashFile(string file)
{
    using (var fs = new FileStream(file, FileMode.Open))
    using (var reader = new BinaryReader(fs))
    {
        var hash = new SHA512CryptoServiceProvider();
        hash.ComputeHash(reader.ReadBytes((int)file.Length));
        return Convert.ToBase64String(hash.Hash);
    }
}

private bool CompareFilesWithHash(string file1, string file2)
{
    var str1 = HashFile(file1);
    var str2 = HashFile(file2);
    return str1 == str2;
}

Here, you can get what is the fastest.

var sw = new Stopwatch();
sw.Start();
var compare1 = CompareFilesWithHash(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks));
sw.Reset();
sw.Start();
var compare2 = CompareFilesByte(receiveLogPath, logPath);
sw.Stop();
Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

Optionally, we can save the hash in a database.

Hope this can help

查看更多
刘海飞了
6楼-- · 2019-01-02 20:34

The only thing that might make a checksum comparison slightly faster than a byte-by-byte comparison is the fact that you are reading one file at a time, somewhat reducing the seek time for the disk head. That slight gain may however very well be eaten up by the added time of calculating the hash.

Also, a checksum comparison of course only has any chance of being faster if the files are identical. If they are not, a byte-by-byte comparison would end at the first difference, making it a lot faster.

You should also consider that a hash code comparison only tells you that it's very likely that the files are identical. To be 100% certain you need to do a byte-by-byte comparison.

If the hash code for example is 32 bits, you are about 99.99999998% certain that the files are identical if the hash codes match. That is close to 100%, but if you truly need 100% certainty, that's not it.

查看更多
情到深处是孤独
7楼-- · 2019-01-02 20:36

My experiments show that it definitely helps to call Stream.ReadByte() fewer times, but using BitConverter to package bytes does not make much difference against comparing bytes in a byte array.

So it is possible to replace that "Math.Ceiling and iterations" loop in the comment above with the simplest one:

            for (int i = 0; i < count1; i++)
            {
                if (buffer1[i] != buffer2[i])
                    return false;
            }

I guess it has to do with the fact that BitConverter.ToInt64 needs to do a bit of work (check arguments and then perform the bit shifting) before you compare and that ends up being the same amount of work as compare 8 bytes in two arrays.

查看更多
登录 后发表回答