可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

say I have a blob of text 5000 characters. I run it through a hashing program and generates a 40 char long hash. now i run another blob of text, 10000 characters. it still generates a hash 40 chars long. that's true for text of any length.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

回答1:

One way to think of a hash is like a human fingerprint (hashes are also sometimes referred to as fingerprints)..

You can "compress" any person in to a (pretty much) unique finger-print.. but, you cannot know who someone is by their fingerprint alone.. This is just like a hash, you can easily work out hash("abcdef") -> a1b2c3, but given only a1b2c3, you cannot trivially tell the source data.

To reverse a finger print, you need to compare the fingerprint to a database of known people->finger-prints (if the unknown fingerprint matches Person1, the unknown fingerprint belongs to them)

With a hash, again you must do much the same thing - you have a database with all string->hash mappings (called a rainbow table). Then you lookup the row with the hash "a1b2c3" and it shows "abcdef" was hashed in order to get this. The other more common way is to simply try every combination of characters, hash them and compare (a brute force attack)

Finally, while human fingerprints are "unique", it's possible to have two the same, it's just incredibly unlikely - it's the same with hashing... Some hashing algorithms are more susceptible to collisions than others.

my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?

Theoretically hashing is a great compression method, but to decompress is incredibly impractical beyond (say) 10 ASCII characters of data.. You're right, you can compress anything to a 40 character string, but you cannot decompress it practically (even theoretically is a bit of a stretch..)

回答2:

Hashing is not unique.

Hashing is a technique to attempt to generate a unique hash for each value fed to it, but it is not guaranteed unique.

Good hashing algorithms will have duplicate hash values much less frequently than bad hash algorithms. Also, hashing is one directional - meaning you can't go from a hash -> original, so it's not meant for compression.

And: A hash doesn't need to be unique. The same input needs to be tranformed into the same hash by the algorithm. You don't use a hash as identifier!

回答3:

Not all hashes are guaranteed to be unique. The wikipedia entry on the topic is pretty good: http://en.wikipedia.org/wiki/Hash_function

回答4:

RSA hashes are not unique. There's an extremely tiny (something to the order of 1:36^40) chance that you will create a false positive when hashing two different bits of clear text. For most applications, the chance is considered sufficiently small that you can ignore it, since it would take, on average, millions of years to see an accidental collision.

回答5:

Hashing is for spreading as good as possible, not for uniqueness!

Of course, reaching uniqueness is a reaching a 100% spreading, but that's often not possible, no matter how good your hashing algorithm.

Striking example :

C# for example provide an Int32 code for each object as HashCode... So also for Int64 :

       Int64 a = Int64.MaxValue;
       Int32 myHash =  a.GetHashCode();

Conclusion here : there are 2^64 different possible instances of Int64, but only 2^32 hashcodes for them!!

So : each hashvalue for an Int64 is shared by (average)

4 294 967 295

other Int64's!

So much for uniqueness hey :-)

回答6:

Consider looking at this from the point of view of the Pigeonhole Principle. If you're stuffing n items into a smaller number of buckets k, there will necessarily be some buckets with multiple items. So to answer your question, no hashes are not unique.

回答7:

Hashing is not guaranteed to be unique, but if you are looking for a unique hash, look at gperf. It can generate a unique hashing function for a set of predetermined inputs.

回答8:

You can compress the signature of any text into a hash, but you cannot reverse calculate what the text was to give you that hash. Simply speaking the only way to find out what the text was that gave you the hash would be to brute-force text through the hash to try and find a match.

See Wikipedia

回答9:

Don't get confused by the .Net GetHashCode(). It's not very good as it's only 32 bits compared to 640 bits in the original question (if each character is 8 bits).

回答10:

If you are using a well defined hash function correctly you can practically assume that hash results are unique.

Issue, with your question is a hash is a one way function. There is no inverse function to take a value and go back to your original blob. Unless you keep a huge table of all possible original values (so called rainbow table).

回答11:

They are not unique, but you're much more likely to drop dead of a heart attack before you find two different documents with the same hash for a high quality algorithm, e.g. SHA-1