Is MD5 still good enough to uniquely identify file

2019-01-07 04:38发布

Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts?

标签: hash md5
9条回答
地球回转人心会变
2楼-- · 2019-01-07 04:44

When hashing short (< a few K ?) strings (or files) one can create two md5 hash keys, one for the actual string and a second one for the reverse of the string concatenated with a short asymmetric string. Example : md5 ( reverse ( string || '1010' ) ). Adding the extra string ensures that even files consisting of a series of identical bits generate two different keys. Please understand that even under this scheme there is a theoretical chance of the two hash keys being identical for non-identical strings, but the probability seems exceedingly small - something in the order of the square of the single md5 collision probability, and the time saving can be considerable when the number of files is growing. More elaborate schemes for creating the second string could be considered as well, but I am not sure that these would substantially improve the odds.

To check for collisions one can run this test for the uniqueness of the md5 hash keys for all bit_vectors in a db:

select md5 ( bit_vector ), count(*), bit_and ( bit_vector) from db with bit_vector
group by md5( bit_vector ), bit_vector having bit_and ( bit_vector ) <> bit_vector

查看更多
女痞
3楼-- · 2019-01-07 04:46

An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.

I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.

You may read on the pigeonhole principle.

查看更多
家丑人穷心不美
4楼-- · 2019-01-07 04:49

I like to think of MD5 as an indicator of probability when storing a large amount of file data.

If the hashes are equal I then know I have to compare the files byte by byte, but that might only happen a few times for a false reason, otherwise (hashes are not equal) I can be certain we're talking about two different files.

查看更多
The star\"
5楼-- · 2019-01-07 04:53

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

查看更多
我命由我不由天
6楼-- · 2019-01-07 04:55

Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.

查看更多
姐就是有狂的资本
7楼-- · 2019-01-07 04:58

MD5 has been broken, you could use SHA1 instead (implemented in most languages)

查看更多
登录 后发表回答