say I have a blob of text 5000 characters. I run it through a hashing program and generates a 40 char long hash. now i run another blob of text, 10000 characters. it still generates a hash 40 chars long. that's true for text of any length.
my question is if the hashes are all unique, wouldn't i be able to compress anything into a 40 char string?
Hashing is for spreading as good as possible, not for uniqueness!
Of course, reaching uniqueness is a reaching a 100% spreading, but that's often not possible, no matter how good your hashing algorithm.
Striking example :
C# for example provide an Int32 code for each object as HashCode... So also for Int64 :
Conclusion here : there are 2^64 different possible instances of Int64, but only 2^32 hashcodes for them!!
So : each hashvalue for an Int64 is shared by (average)
other Int64's!
So much for uniqueness hey :-)
Hashing is not unique.
Hashing is a technique to attempt to generate a unique hash for each value fed to it, but it is not guaranteed unique.
Good hashing algorithms will have duplicate hash values much less frequently than bad hash algorithms. Also, hashing is one directional - meaning you can't go from a hash -> original, so it's not meant for compression.
And: A hash doesn't need to be unique. The same input needs to be tranformed into the same hash by the algorithm. You don't use a hash as identifier!
One way to think of a hash is like a human fingerprint (hashes are also sometimes referred to as fingerprints)..
You can "compress" any person in to a (pretty much) unique finger-print.. but, you cannot know who someone is by their fingerprint alone.. This is just like a hash, you can easily work out
hash("abcdef") -> a1b2c3
, but given onlya1b2c3
, you cannot trivially tell the source data.To reverse a finger print, you need to compare the fingerprint to a database of known
people->finger-prints
(if the unknown fingerprint matches Person1, the unknown fingerprint belongs to them)With a hash, again you must do much the same thing - you have a database with all string->hash mappings (called a rainbow table). Then you lookup the row with the hash "a1b2c3" and it shows "abcdef" was hashed in order to get this. The other more common way is to simply try every combination of characters, hash them and compare (a brute force attack)
Finally, while human fingerprints are "unique", it's possible to have two the same, it's just incredibly unlikely - it's the same with hashing... Some hashing algorithms are more susceptible to collisions than others.
Theoretically hashing is a great compression method, but to decompress is incredibly impractical beyond (say) 10 ASCII characters of data.. You're right, you can compress anything to a 40 character string, but you cannot decompress it practically (even theoretically is a bit of a stretch..)
If you are using a well defined hash function correctly you can practically assume that hash results are unique.
Issue, with your question is a hash is a one way function. There is no inverse function to take a value and go back to your original blob. Unless you keep a huge table of all possible original values (so called rainbow table).
RSA hashes are not unique. There's an extremely tiny (something to the order of 1:36^40) chance that you will create a false positive when hashing two different bits of clear text. For most applications, the chance is considered sufficiently small that you can ignore it, since it would take, on average, millions of years to see an accidental collision.
Hashing is not guaranteed to be unique, but if you are looking for a unique hash, look at gperf. It can generate a unique hashing function for a set of predetermined inputs.