many iterations on a hash: doesn't it reduces

I see this technique recommended in many places (including stack), and i can't get out of my head that this would reduce entropy! After all, you are hashing something again, that has already been hashed and has a collision chance. Wouldn't collision chance over collision chance results in more collision chances? After researching, it seems I'm wrong, but why?

回答1:

Since you tagged md5, I'll use that as an example. From wikipedia:

if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more likely to be accepted as valid data by the application using it. Furthermore, current collision-finding techniques allow to specify an arbitrary prefix: an attacker can create two colliding files that both begin with the same content. All the attacker needs to generate two colliding files is a template file with a 128-byte block of data, aligned on a 64-byte boundary that can be changed freely by the collision-finding algorithm. An example MD5 collision, with the two messages differing in 6 bits, is:

And then the example plaintexts they give are 256 bytes long. Since the collision attack relies on a 128 byte block of data, and the hash digest is only 128 bits, there really isn't an increased risk of a collision attack succeeding beyond the first iteration - that is to say that you can't really influence the likelihood of a collision beyond the first hash.

Also consider that the entropy of the hash is the aforementioned 128 bits. Even considering that the total collision chance is only 2^20.96 (again from wikipedia), it would take a great number of iterations to cause two inputs to collide. The first-glance reasoning that I think you're falling victim to is:

Any two arbitrary inputs have a chance of collision x%.
The outputs of the first hash are two such inputs themselves.
Therefore, every iteration increases the chance of collision by x%.

This can be disproven by counterexample fairly easily. Consider again MD5:

The chance for collision of two inputs is 1:2^21 (taking the worst case scenario from wikipedia's cryptography analysis of MD5)
Hashing again causes an equally likely chance of collision to compound, therefore the chance of collision at the second round is 1:2^20
Therfore, for any two inputs hashed a number of times equal to the entropy of the digest are guaranteed to collide.

MD5 any two inputs 128 times in a row and you will see that this is not true. You probably won't find a single repeated hash between them - after all, you've only created 256 out of a possible 2^128 hash values, leaving 2^120 possibilities. The probabilities of collisions between each round is independent of all other rounds.

There are two approaches to understand why this is so. The first is that each iteration is essentially trying to hit a moving target. I think you could construct a proof based on the birthday paradox that there is a surprisingly low number of iterations of hashing where you will likely see one hash digest from one input match the hash digest of a different input. But they would almost certainly have occurred at different steps of the iteration. And once that occurs, they can never have the same output on the same iteration because the hash algorithm itself is deterministic.

The other approach is to realize that the hash function actually adds entropy while it runs. Consider that an empty string has a 128 bit digest just like any other input; that cannot occur without entropy being added during the algorithm steps. This is actually a necessarily part of a cryptographic hash function: data must be destroyed or else the input could be recovered from the digest. For inputs longer than the digest, yes, entropy is lost on the whole; it has to be in order to fit into the digest length. But some entropy is also added.

I don't have as exact numbers for other hash algorithms, but I think all the points I've made generalize to other hash functions and one-way / mapping functions.

回答2:

It does reduce entropy.

In a paper named Random Mapping Statistics by Flajolet and Odlyzko, a theorem (Theorem 2) shows that:

"If an n-bit random function is iterated k times, the expected number of image points is (1 - t_k) * 2^n (for large n), where t_k satisfies the recurrence relation t_0 = 0 and t_{k+1} = e^{-1 + t_k}. From this, it can be shown that the expected number of image points is 2^{n-i+1} when a random function is iterated k = 2^i times."

Further references are as follows:

Gligoroski, D. and Klima, V., 2010, September. Practical consequences of the aberration of narrow-pipe hash designs from ideal random functions. In International Conference on ICT Innovations (pp. 81-93). Springer Berlin Heidelberg.
Bhaumik, R., Dutta, A., Guo, J., Jean, J., Mouha, N. and Nikolić, I., 2015. More Rounds, Less Security?
Dinur, I. and Leurent, G., 2014, August. Improved generic attacks against hash-based MACs and HAIFA. In International Cryptology Conference (pp. 149-168). Springer Berlin Heidelberg.

From the last reference paper, one will find the following two lemmas: Two lemmas on entropy loss. Thus, the observation on entropy loss also hold if k independent random functions are used, instead of one random function that is iterated k times.