Can we say that a truncated md5
hash is still uniformly distributed?
To avoid misinterpretations: I'm aware the chance of collisions is much greater the moment you start to hack off parts from the md5
result; my use-case is actually interested in deliberate collisions. I'm also aware there are other hash methods that may be better suited to use-cases of a shorter hash (including, in fact, my own), and I'm definitely looking into those.
But I'd also really like to know whether md5
's uniform distribution also applies to chunks of it. (Consider it a burning curiosity.)
Since mediawiki uses it (specifically, the left-most two hex-digits as characters of the result) to generate filepaths for images (e.g. /4/42/The-image-name-here.png
) and they're probably also interested in an at least near-uniform distribution, I imagine the answer is 'yes', but I don't actually know.
I wrote a little php-program to answer this question. It's not very scientific, but it shows the distribution for the first and the last 8 bits of the hashvalues using the natural numbers as hashtext. After about 40.000.000 hashes the difference between the highest and the lowest counts goes down to 1%, so I'd say the distribution is ok. I hope the code is more precise in explaining what was computed :-) Btw, with a similar program I found that the last 8 bits seem to be distributed slightly better than the first.
Yes, not exhibiting any bias is a design requirement for a cryptographic hash. MD5 is broken from a cryptographic point of view however the distribution of the results was never in question.
If you still need to be convinced, it's not a huge undertaking to hash a bunch of files, truncate the output and use ent ( http://www.fourmilab.ch/random/ ) to analyze the result.