Having just recently delved into the world of NoSQL with MongoDB, I am still struggling to understand the best approach to architecture without 3rd normalizing the data and then joining upon it. Currently the project I am designing is a simple collection of articles, akin to a wiki. An article will have a title and text, as well as (possibly) a parent article and one or more children articles.
I have had a number of differing ideas of for the database design and want to pick the one that best matches the strenghts of MongoDB.
Idea One
Since the most frequent type of query on the database will invariably be to simply retrieve an article, I am embedding all the relevant data that the page will need in order to display everything. The actual article of course, as well as a parent subdocument with a url (which will match the _id of some other document) as well as a title which is the text that we will print out on screen for inside of the tag. An identicle structure exists for the children except that it is an array so that all of the children are there.
{
"_id" : "test-article-2",
"title" : "Test Article 2",
"text" : "Blah 2",
"parent" : {
"title" : "Test Article",
"url" : "test-article"
},
"children" : [
{
"title" : "Test Article 3",
"url" : "test-article-3"
}
]
}
This type of design seems to have the advantage of speed (in my opinion) but I would like to hear what others thing of this design.
Idea Two
More along the lines that I am used to coming from a relational database world. would be to not embed sub-objects into the design but to simply put in their unique identifiers. Thus parent now contains just a text string which will match the _id of some other document, and children will similarly have an array of strings which link to _ids.
In order to get all of the information to view an article however, we would now need to make a number of queries (at least I think we need to...) One to get the main article, then another to get the parent's title for putting in the tag and then another to get all of the children articles and likewise get their titles.
This seems like a lot more querying just to display an article, but it may make the updating of the database easier, if for instance some article is deleted or updated. (again not sure on that point).
{
"_id" : "test-article-2",
"title" : "Test Article 2",
"text" : "Blah 2",
"parent" : "test-article",
"children" : [ "test-article-3", "test-article-4"]
}
Would be glad to hear the input of those with more experience with MongoDB designing.
You need to consider the type of queries you will need to perform and how frequently each type will be needed. When I was working on something similar, I came up with six possible actions:
You'll want to estimate how important each of these is to your application.
If most of your work involves working with stored data for some given article including its immediate parent and children, the first idea is most useful. Indeed in MongoDB, it is quite common to place all the information you need in the same document rather than referencing it externally so that you only need to retrieve one thing and just work with that data. The last four actions in the list are more tricky though.
In particular, you will need to traverse through the tree to retrieve ancestors and descendants in this case, moving through intermediary documents and following a path, even though you may only care about the last document in the path. This can be slow for long hierarchies. Changing relationships can require moving a lot of information around in multiple documents because of all the data present in each one. But even changing a single field like "title" can be annoying, because you have to consider the fact that this field is present in multiple different documents, either as a main field or under the parent or children fields.
Basically, your first idea works best in more static applications where you won't be changing the data a lot after initially creating it, but where you need to read it regularly.
The MongoDB documentation has five recommended approaches for handling tree-like (hierarchical) structures. All of them have different advantages and disadvantages, though they all make it easy to update the main data in an article by only needing to do so in one document.
The five approaches are discussed in more detail in the MongoDB documentation.
Your second idea combines the "Parent References" and "Child References" approaches discussed above. This approach makes it easy to find both the children and the parent and makes it easy to update relationships and the main data of an article (though you need to update both the parent and the children fields), but you still need to traverse through it to find ancestors and descendants.
If you are interested in finding ancestors and descendants (and care about this more than being able to easily update relationships), you can consider adding an ancestors array to your second idea to make it also easy to query for ancestors and descendants. Of course, updating relationships becomes a real pain if you do this though.
Conclusion:
Ultimately it all depends on what actions are needed the most. Since you're working with articles, whose underlying data (like the title) can change frequently, you may want to avoid the first idea since you would need to update not only the main document for that article but all child documents as well as the parent.
Your second idea makes it easy to retrieve the immediate parent and children. Updating relationships is also not too difficult (It's certainly better than some of the other options available).
If you really want to make it easy to find ancestors and descendants at the expense of updating relationships as easily, choose to include an array of ancestor references.
In general, try to minimize the number of traversals required, as they require running some kind of iteration or recursion to get to the data you want. If you value the ability to update relationships, you should also pick an option that changes fewer nodes in the tree (Parent References, Child References, and your second idea can do this).
Very nice summary from @CynicalProgrammer, I would add one more: use the fact that json is a tree to your adventage! No need to store only the 'left' and 'right' node ID in each node, why not store a subtree, 3-4-5 deep down? So it would look like this:
This way, you'd need hell of a lot less queries for traverse the tree, and the number of documents in the collection would be a fraction. A slight drawback is that you need somewhat more complex code to write to the tree, and mongodb documents will be bigger, meaning individual writes are slower.
I think this is probably the best way to store trees in mongodb. But again, remember, mongo is meant for document storage, not "big-ass schemaless tree" storage. You might want to look into something like neo4j for that.