What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.
The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.
(while Java code-examples are preferred, language-agnostic answers are encouraged)
Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).
An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.
That's what hashing is for. See MessageDigest.
Note that if your file is too big to be read in memory, that's OK because you can feed chunks of the file to the hash function. MD5 and SHA1 for example can take blocks of 512 bits.
Also, two files with the same hash aren't necessarily identical (it's very rare that they aren't though), but two files that are identical have necessarily the same hash.
The usual answer is to use MD5, but I'd like to suggest that there are too many collisions to use MD5 in modern applications: http://www.mscs.dal.ca/~selinger/md5collision/
SHA-1 replaced MD5 over a decade ago.
NIST recommended in 2005 that SHA-2 should be used in place of SHA-1 by the year 2010, because of work that had been done to demonstrate collisions in reduced variants of SHA-1. (Which is pretty good foresight, since it is now known that it takes 2^51 work to find collisions in what should ideally require 2^80 work to find collisions.)
So please, based on what you're trying to do, and which other programs you may need to interoperate with, select among MD5 (please no), SHA-1 (I'd understand, but we can do better), and SHA-2 (pick me! pick me!).
Are you taking into account to use header identification.
If you can design your files in such way, this would be fast and reliable.
Using one byte you can distinguish 255 file types ;)