Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
- Would a checksum comparison such as CRC be faster?
- Are there any .NET libraries that can generate a checksum for a file?
Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
Here are some utility functions that allow you to determine if two files (or two streams) contain identical data.
I have provided a "fast" version that is multi-threaded as it compares byte arrays (each buffer filled from what's been read in each file) in different threads using Tasks.
As expected, it's much faster (around 3x faster) but it consumes more CPU (because it's multi threaded) and more memory (because it needs two byte array buffers per comparison thread).
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the one-line solution is:
Unlike some other posted answers, this works correctly for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source-code comments, etc.) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing gigantic files. Aside from that consideration, the full-loading isn't really a penalty; in fact, this could be an optimal .NET solution for file sizes which are expected to be less than 85K, since small allocations in
.NET
are very cheap and we maximally delegate file performance and optimization to theCLR
/BCL
.Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via
LINQ
enumerators (as shown here) are moot, since hitting the disk for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even thoughSequenceEqual
does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary to confirm the match..On the other hand, the above code does not include eager abort for differently-sized files, which can provide a tangible (possibly measurable) performance difference. This one is tangible because, whereas file length is available in the
WIN32_FILE_ATTRIBUTE_DATA
structure (which must be fetched first anyway for any file access), continuing on with accessing the file's contents requires an entirely different fetch which might potentially be avoided. If you're concerned about this, the solution becomes two lines:You could also extend this to avoid the secondary fetches if the (equivalent)
Length
values are both found to be zero (not shown) and/or to avoid building eachFileInfo
twice (also not shown).My answer is a derivative of @lars but fixes the bug in the call to
Stream.Read
. I also add some fast path checking that other answers had, and input validation. In short, this should be the answer:Or if you want to be super-awesome, you can use the async variant:
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
This I have found works well comparing first the length without reading data and then comparing the read byte sequence