可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.

I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.

I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.

I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.

This is not what I want.

I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.

It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.

I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?

And if not, whats (if there is) the proof against it?

The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...

回答1:

It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.

It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.

(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2^{-8*byte_offset} in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)

Note: as per @KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)

回答2:

Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.

For example, let

hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.

Let

Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)

Then

hash(Z) == hash(X) == last 32-bits of Z

The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.

回答3:

No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.

回答4:

I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.

It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.

EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.

回答5:

the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing

they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value

but if you don't use such a method is it impossible

回答6:

It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).

Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.