Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.
Any thoughts?
Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.
Any thoughts?
Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.
For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.
Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.
MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.
Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).
An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.
I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.
You may read on the pigeonhole principle.
I wouldn't recommend it. If the application would work on multi-user system, there might be user, that would have two files with the same md5 hash (he might be engineer and play with such files, or be just curious - they are easily downloadable from http://www2.mat.dtu.dk/people/S.Thomsen/wangmd5/samples.html , I myself during writing this answer downloaded two samples). Another thing is, that some applications might store such duplicates for whatever reason (I'm not sure, if there are any such applications but the possibility exists).
If you are uniquely identifying files generated by your program I would say it is ok to use MD5. Otherwise, I would recommend any other hash function where no collisions are known yet.
Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.
MD5 has been broken, you could use SHA1 instead (implemented in most languages)
When hashing short (< a few K ?) strings (or files) one can create two md5 hash keys, one for the actual string and a second one for the reverse of the string concatenated with a short asymmetric string. Example : md5 ( reverse ( string || '1010' ) ). Adding the extra string ensures that even files consisting of a series of identical bits generate two different keys. Please understand that even under this scheme there is a theoretical chance of the two hash keys being identical for non-identical strings, but the probability seems exceedingly small - something in the order of the square of the single md5 collision probability, and the time saving can be considerable when the number of files is growing. More elaborate schemes for creating the second string could be considered as well, but I am not sure that these would substantially improve the odds.
To check for collisions one can run this test for the uniqueness of the md5 hash keys for all bit_vectors in a db:
select md5 ( bit_vector ), count(*), bit_and ( bit_vector)
from db with bit_vector
group by md5( bit_vector ), bit_vector
having bit_and ( bit_vector ) <> bit_vector
I like to think of MD5 as an indicator of probability when storing a large amount of file data.
If the hashes are equal I then know I have to compare the files byte by byte, but that might only happen a few times for a false reason, otherwise (hashes are not equal) I can be certain we're talking about two different files.