How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.
This article is interesting as it shows some retrieval methods as well as a way to store the lineage as a derived column. The lineage provides a shortcut method to retrieve the hierarchy without too many joins.
Not going to work for all situations, but for example given a comment structure:
You could also store
TopCommentID
which represents the top most comment:Where the
TopCommentID
andParentCommentID
arenull
or0
when it's the topmost comment. For child comments,ParentCommentID
points to the comment above it, andTopCommentID
points to the topmost parent.Out of all the ways to store a tree in a RDMS the most common are adjacency lists and nested sets. Nested sets are optimized for reads and can retrieve an entire tree in a single query. Adjacency lists are optimized for writes and can added to with in a simple query.
With adjacency lists each node a has column that refers to the parent node or the child node (other links are possible). Using that you can build the hierarchy based on parent child relationships. Unfortunately unless you restrict your tree's depth you cannot pull the whole thing in one query and reading it is usually slower than updating it.
With the nested set model the inverse is true, reading is fast and easy but updates get complex because you must maintain the numbering system. The nested set model encodes both parentage and sort order by enumerating all of the nodes using a preorder based numbering system.
I've used the nested set model and while it is complex for read optimizing a large hierarchy it is worth it. Once you do a few exercises in drawing out the tree and numbering the nodes you should get the hang of it.
My research on this method started at this article: Managing Hierarchical Data in MySQL.
In the product I work on we have some tree structures stored in SQL Server and use the technique mentioned above to store a node's hierarchy in the record. i.e.
Maintaining the the hierarchy is the tricky bit of course and makes use of triggers. But generating it on an insert/delete/move is never recursive, because the parent or child's hierarchy has all the information you need.
you can get all of node's descendants thusly:
Here's the insert trigger:
and here's the update trigger:
one more bit, a check constraint to prevent a circular reference in tree nodes:
I would also recommend triggers to prevent more than one root node (null parent) per tree, and to keep related nodes from belonging to different TreeIDs (but those are a little more trivial than the above.)
You'll want to check for your particular case to see if this solution performs acceptably. Hope this helps!
If you have many trees in the database, and you will only ever get the whole tree out, I would store a tree ID (or root node ID) and a parent node ID for each node in the database, get all the nodes for a particular tree ID, and process in memory.
However if you will be getting subtrees out, you can only get a subtree of a particular parent node ID, so you either need to store all parent nodes of each node to use the above method, or perform multiple SQL queries as you descend into the tree (hope there are no cycles in your tree!), although you can reuse the same Prepared Statement (assuming that nodes are of the same type and are all stored in a single table) to prevent re-compiling the SQL, so it might not be slower, indeed with database optimisations applied to the query it could be preferable. Might want to run some tests to find out.
If you are only storing one tree, your question becomes one of querying subtrees only, and the second answer applied.
Google for "Materialized Path" or "Genetic Trees"...