I am trying to understand the internal allocation and placement of arrays and hashes (which, from my understanding are implemented through arrays) in MongoDB documents.
In our domain we have documents with anywhere between thousands and hundreds of thousands of key-value pairs in logical groupings up to 5-6 levels deeps (think nested hashes).
We represent the nesting in the keys with a dot, e.g., x.y.z
, which upon insertion into MongoDB will automatically become something like:
{
"_id" : "whatever",
"x" : {
"y" : {
"z" : 5
}
}
}
The most common operation is incrementing a value, which we do with an atomic $inc
, usually 1000+ values at a time with a single update command. New keys are added over time but not frequently, say, 100 times/day.
It occurred to me that an alternative representation would be to not use dots in names but some other delimiter and create a flat document, e.g.,
{
"_id" : "whatever",
"x-y-z" : 5
}
Given the number of key-value pairs and the usage pattern in terms of $inc
updates and new key insertion, I am looking for guidance on the trade-offs between the two approaches in terms of:
The on-disk storage of documents in MongoDB is in BSON format. There is a detailed description of the BSON format here:
- http://bsonspec.org/#/specification
While there is some disk savings from using short key names (since, as you can see by looking at the spec, the key name is embedded in the document), it looks to me like there'd be almost no net difference between the two designs in terms of on-disk space used -- the extra bytes you use by using the delimiters (-) get bought back by not having to have string terminators for the separate key values.
$inc updates should take almost identical times with both formats, since they're both going to be in-memory operations. Any improvements in in-memory update time are going to be the tiniest of rounding errors compared to the time taken to read the document off of disk.
The performance of new key inserts should also be virtually identical. If adding the new key/value pair leaves the new document small enough to fit in the old location on disk, then all that happens is the in-memory version is updated and a journal entry gets written. Eventually, the in-memory version will be written to disk.
New key inserts are more problematic if the document grows beyond the space previously allocated for it. In that case, the server must move the document to a new location and update all indexes pointing to that document. This is generally a slower operation, and should be avoided However, the schema changes that you're discussing shouldn't affect the frequency of document movement. Again, I think this is a wash.
My suggestion would be to use the schema that most lends itself to developer productivity. If you're having performance problems, then you can ask separate questions about how you can either scale your system or improve performance, or both.