How would you get tree-structured data from a database with the best performance? For example, say you have a folder-hierarchy in a database. Where the folder-database-row has ID, Name and ParentID columns.
Would you use a special algorithm to get all the data at once, minimizing the amount of database-calls and process it in code?
Or would you use do many calls to the database and sort of get the structure done from the database directly?
Maybe there are different answers based on x amount of database-rows, hierarchy-depth or whatever?
Edit: I use Microsoft SQL Server, but answers out of other perspectives are interesting too.
Celko wrote about this (2000):
http://www.dbmsmag.com/9603d06.html
http://www.intelligententerprise.com/001020/celko1_1.jhtml;jsessionid=3DFR02341QLDEQSNDLRSKHSCJUNN2JVN?_requestid=32818
and other people asked:
Joining other tables in oracle tree queries
How to calculate the sum of values in a tree using SQL
How to store directory / hierarchy / tree structure in the database?
Performance of recursive stored procedures in MYSQL to get hierarchical data
What is the most efficient/elegant way to parse a flat table into a tree?
finally, you could look at the rails "acts_as_tree" (read-heavy) and "acts_as_nested_set" (write-heavy) plugins. I don't ahve a good link comparing them.
It really depends on how you are going to access the tree.
One clever technique is to give every node a string id, where the parent's id is a predictable substring of the child. For example, the parent could be '01', and the children would be '0100', '0101', '0102', etc. This way you can select an entire subtree from the database at once with:
Because the criterion is an initial substring, an index on the ID column would speed the query.
I am a fan of the simple method of storing an ID associated with its parentID:
It is easy to maintain, and very scalable.
In Oracle there is SELECT ... CONNECT BY statement to retrieve trees.
There are several common kinds of queries against a hierarchy. Most other kinds of queries are variations on these.
From a parent, find all children.
a. To a specific depth. For example, given my immediate parent, all children to a depth of 1 will be my siblings.
b. To the bottom of the tree.
From a child, find all parents.
a. To a specific depth. For example, my immediate parent is parents to a depth of 1.
b. To an unlimited depth.
The (a) cases (a specific depth) are easier in SQL. The special case (depth=1) is trivial in SQL. The non-zero depth is harder. A finite, but non-zero depth, can be done via a finite number of joins. The (b) cases, with indefinite depth (to the top, to the bottom), are really hard.
If you tree is HUGE (millions of nodes) then you're in a world of hurt no matter what you try to do.
If your tree is under a million nodes, just fetch it all into memory and work on it there. Life is much simpler in an OO world. Simply fetch the rows and build the tree as the rows are returned.
If you have a Huge tree, you have two choices.
Recursive cursors to handle the unlimited fetching. This means the maintenance of the structure is O(1) -- just update a few nodes and you're done. However fetching is O(n*log(n)) because you have to open a cursor for each node with children.
Clever "heap numbering" algorithms can encode the parentage of each node. Once each node is properly numbered, a trivial SQL SELECT can be used for all four types of queries. Changes to the tree structure, however, require renumbering the nodes, making the cost of a change fairly high compared to the cost of retrieval.
look into the nested sets hierarchy model. it's pretty cool and useful.