Creating custom Object ID in MongoDB

2019-01-09 01:23发布

问题:

I am creating a service for which I will use MongoDB as a storage backend. The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.

The hash will be unique yet random ( = non-incremental/sequential), so my question is:

  1. Is it -legitimate- to use a random value for an Object ID? Example:

$object_id = new MongoId(HEX-OF-96BIT-HASH);

Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?

What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?

回答1:

Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.

Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object. There are some guidelines to optimize user defined object ids :

http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs#OptimizingObjectIDs-Usethecollections%27naturalprimarykey%27intheidfield.



回答2:

While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:

  1. You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.

  2. Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.



回答3:

I just found out an answer to one of my questions, regarding indexing performance:

If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.

Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs



回答4:

Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.

There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.



标签: mongodb