Is MD5 still good enough to uniquely identify file

Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts?

标签： hash md5

9条回答

地球回转人心会变

2楼-- · 2019-01-07 04:44

When hashing short (< a few K ?) strings (or files) one can create two md5 hash keys, one for the actual string and a second one for the reverse of the string concatenated with a short asymmetric string. Example : md5 ( reverse ( string || '1010' ) ). Adding the extra string ensures that even files consisting of a series of identical bits generate two different keys. Please understand that even under this scheme there is a theoretical chance of the two hash keys being identical for non-identical strings, but the probability seems exceedingly small - something in the order of the square of the single md5 collision probability, and the time saving can be considerable when the number of files is growing. More elaborate schemes for creating the second string could be considered as well, but I am not sure that these would substantially improve the odds.

To check for collisions one can run this test for the uniqueness of the md5 hash keys for all bit_vectors in a db:

select md5 ( bit_vector ), count(*), bit_and ( bit_vector) from db with bit_vector
group by md5( bit_vector ), bit_vector having bit_and ( bit_vector ) <> bit_vector

0人赞添加讨论(0) 举报

女痞

3楼-- · 2019-01-07 04:46

An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.

I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.

You may read on the pigeonhole principle.

0人赞添加讨论(0) 举报

家丑人穷心不美

4楼-- · 2019-01-07 04:49

I like to think of MD5 as an indicator of probability when storing a large amount of file data.

If the hashes are equal I then know I have to compare the files byte by byte, but that might only happen a few times for a false reason, otherwise (hashes are not equal) I can be certain we're talking about two different files.

0人赞添加讨论(0) 举报

The star\"

5楼-- · 2019-01-07 04:53

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

0人赞添加讨论(0) 举报

我命由我不由天

6楼-- · 2019-01-07 04:55

Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.

0人赞添加讨论(0) 举报

姐就是有狂的资本

7楼-- · 2019-01-07 04:58

MD5 has been broken, you could use SHA1 instead (implemented in most languages)

0人赞添加讨论(0) 举报

1 2 下一页

Is MD5 still good enough to uniquely identify file

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间