Having just recently delved into the world of NoSQL with MongoDB, I am still struggling to understand the best approach to architecture without 3rd normalizing the data and then joining upon it. Currently the project I am designing is a simple collection of articles, akin to a wiki. An article will have a title and text, as well as (possibly) a parent article and one or more children articles.
I have had a number of differing ideas of for the database design and want to pick the one that best matches the strenghts of MongoDB.
Idea One
Since the most frequent type of query on the database will invariably be to simply retrieve an article, I am embedding all the relevant data that the page will need in order to display everything. The actual article of course, as well as a parent subdocument with a url (which will match the _id of some other document) as well as a title which is the text that we will print out on screen for inside of the tag. An identicle structure exists for the children except that it is an array so that all of the children are there.
{
"_id" : "test-article-2",
"title" : "Test Article 2",
"text" : "Blah 2",
"parent" : {
"title" : "Test Article",
"url" : "test-article"
},
"children" : [
{
"title" : "Test Article 3",
"url" : "test-article-3"
}
]
}
This type of design seems to have the advantage of speed (in my opinion) but I would like to hear what others thing of this design.
Idea Two
More along the lines that I am used to coming from a relational database world. would be to not embed sub-objects into the design but to simply put in their unique identifiers. Thus parent now contains just a text string which will match the _id of some other document, and children will similarly have an array of strings which link to _ids.
In order to get all of the information to view an article however, we would now need to make a number of queries (at least I think we need to...) One to get the main article, then another to get the parent's title for putting in the tag and then another to get all of the children articles and likewise get their titles.
This seems like a lot more querying just to display an article, but it may make the updating of the database easier, if for instance some article is deleted or updated. (again not sure on that point).
{
"_id" : "test-article-2",
"title" : "Test Article 2",
"text" : "Blah 2",
"parent" : "test-article",
"children" : [ "test-article-3", "test-article-4"]
}
Would be glad to hear the input of those with more experience with MongoDB designing.
You need to consider the type of queries you will need to perform and how frequently each type will be needed. When I was working on something similar, I came up with six possible actions:
- Do something with the parent
- Do something with the children
- Do something with the ancestors (parents of parents, parents of parents of parents, etc.)
- Do something with the descendants (children of children, children of children of children, etc.)
- Change relationships (add/move/delete nodes in the hierarchy)
- Change the main data in the current node (ex. changing the value in the "title" field)
You'll want to estimate how important each of these is to your application.
If most of your work involves working with stored data for some given article including its immediate parent and children, the first idea is most useful. Indeed in MongoDB, it is quite common to place all the information you need in the same document rather than referencing it externally so that you only need to retrieve one thing and just work with that data. The last four actions in the list are more tricky though.
In particular, you will need to traverse through the tree to retrieve ancestors and descendants in this case, moving through intermediary documents and following a path, even though you may only care about the last document in the path. This can be slow for long hierarchies. Changing relationships can require moving a lot of information around in multiple documents because of all the data present in each one. But even changing a single field like "title" can be annoying, because you have to consider the fact that this field is present in multiple different documents, either as a main field or under the parent or children fields.
Basically, your first idea works best in more static applications where you won't be changing the data a lot after initially creating it, but where you need to read it regularly.
The MongoDB documentation has five recommended approaches for handling tree-like (hierarchical) structures. All of them have different advantages and disadvantages, though they all make it easy to update the main data in an article by only needing to do so in one document.
- Parent References: each node contains a reference to its parent.
- Advantages:
- Fast parent lookup (lookup by "_id" = your doc title, return "parent" field)
- Fast children lookup (lookup by "parent" = your doc title, which will return all child documents)
- Updating relationships is just a matter of changing the "parent" field
- Changing the underlying data requires changes to only one document
- Disadvantages:
- Searching by ancestors and descendants is slow, requiring a traversal
- Child References: each node contains a reference array to its children
- Advantages:
- Fast retrieval of children (return the children array)
- Fast relationship update (just update children's array where needed)
- Disadvantages:
- Finding a parent requires looking up your _id in all children arrays of all documents until you find it (since the parent will contain the current node as a child)
- Ancestors & descendants search require traversals of the tree
- Array of Ancestors: each node contains a reference to an array of its ancestors & its parent
- Advantages:
- Fast retrieval of ancestors (no traversal required to find a specific one)
- Easy to lookup parent and children following the "Parent References" approach
- To find descendants, just look up the ancestors, because all descendants must contain the same ancestors
- Disadvantages:
- Need to worry about keeping the array of ancestors as well as the parent field updated whenever there is a change in relationships, often across multiple documents.
- Materialized Paths: each node contains a path to itself - requires regex
- Advantages:
- Easy to find children and descendants using regex
- Can use a path to retrieve parent and ancestors
- Flexibility, such as finding nodes by partial paths
- Disadvantages:
- Relationship changes are difficult as they may require changes to paths across multiple documents
- Nested Sets: Each node contains a "left" and "right" field to help find sub-trees
- Advantages:
- Easy to retrieve descendants in an optimal way by searching between "left" and "right"
- Like the "Parent Reference" approach, it's easy to find parent and children
- Disadvantages:
- Need to traverse structure to find ancestors
- Relationship changes perform the worst here than any other option because every single document in the tree may need to be changed to make sure "left" and "right" still make sense once something changes in the hierarchy
The five approaches are discussed in more detail in the MongoDB documentation.
Your second idea combines the "Parent References" and "Child References" approaches discussed above. This approach makes it easy to find both the children and the parent and makes it easy to update relationships and the main data of an article (though you need to update both the parent and the children fields), but you still need to traverse through it to find ancestors and descendants.
If you are interested in finding ancestors and descendants (and care about this more than being able to easily update relationships), you can consider adding an ancestors array to your second idea to make it also easy to query for ancestors and descendants. Of course, updating relationships becomes a real pain if you do this though.
Conclusion:
Ultimately it all depends on what actions are needed the most. Since you're working with articles, whose underlying data (like the title) can change frequently, you may want to avoid the first idea since you would need to update not only the main document for that article but all child documents as well as the parent.
Your second idea makes it easy to retrieve the immediate parent and children. Updating relationships is also not too difficult (It's certainly better than some of the other options available).
If you really want to make it easy to find ancestors and descendants at the expense of updating relationships as easily, choose to include an array of ancestor references.
In general, try to minimize the number of traversals required, as they require running some kind of iteration or recursion to get to the data you want. If you value the ability to update relationships, you should also pick an option that changes fewer nodes in the tree (Parent References, Child References, and your second idea can do this).
Very nice summary from @CynicalProgrammer, I would add one more: use the fact that json is a tree to your adventage! No need to store only the 'left' and 'right' node ID in each node, why not store a subtree, 3-4-5 deep down? So it would look like this:
{
left: {
left: {
left: {...},
right: {...}
},
right: {...}
},
right: {... you get the idea ...}
}
This way, you'd need hell of a lot less queries for traverse the tree, and the number of documents in the collection would be a fraction. A slight drawback is that you need somewhat more complex code to write to the tree, and mongodb documents will be bigger, meaning individual writes are slower.
I think this is probably the best way to store trees in mongodb. But again, remember, mongo is meant for document storage, not "big-ass schemaless tree" storage. You might want to look into something like neo4j for that.