Background: I have am taking over an application (original engineer is leaving) that act as caching layer of some relatively slow backend services. Because it's RESTful style URL, each URL is unique. The application uses MongoDb as the storage for cache, and uses hash value as the cache. Although hash code should be pretty unique, but it is not unique.
Question: I was told the reason to use hash code (instead of the url) was because MongoDb's _id field has limit on length, but I can't find any document on that. All I can find in the MongoDb documentation is "_id field can be anything other than array as long as it's unique". Is it true that MongoDb's _id field has length limit? If so what is the limit size?
The application is written in Java. Oh, and I am new to MongoDb.
There is a limit to the length of the field to be indexed, which is 1024 bytes. That's a limitation on index entry size rather than document field size which are limitated at ~16MB (the maximum size of a complete document).
For performance reasons you do not really want large field values for indexed fields as comparisons against such big values are considerably slower. Also remember that every index maintains copies of the values being indexed so it would require significant amounts of memory. That in turns means more frequent disk access to swap virtual memory pages in and out of memory which again has a negative impact on performance.
So yes, limited to 800 bytes.
Hash collisions should be rare if you are using a good hash function with a long enough hash value. For example, if your hash outputs a 128-bit value, you will typically get a collision after producing 2^64 hashes -- so if you were producing a million hashes a second, you'd get a collision after about 600,000 years. This is probably good enough for most purposes.