I have a PHP web application which uses a MySQL database for object tagging, in which I've used the tag structure accepted as the answer to this SO question.
I'd like to implement a tag hierarchy, where each tag can have a unique parent tag. Searches for a parent tag T would then match all descendants of T (i.e. T, tags whos parent is T (children of T), grandchildren of T, etc.).
The easiest way of doing this seems to be to add a ParentID field to the tag table, which contains the ID of a tag's parent tag, or some magic number if the tag has no parent. Searching for descendants, however, then requires repeated full searches of the database to find the tags in each 'generation', which I'd like to avoid.
A (presumably) faster, but less normalised way of doing this would be to have a table containing all the children of each tag, or even all the descendants of each tag. This however runs the risk of inconsistent data in the database (e.g. a tag being the child of more than one parent).
Is there a good way to make queries to find descendants fast, while keeping the data as normalised as possible?
I implemented it using two columns. I simplify it here a little, because I had to keep the tag name in a separate field/table because I had to localize it for different languages:
Look at these rows for example:
tag path
--- ----
database database/
mysql database/mysql/
mysql4 database/mysql/mysql4/
mysql4-1 database/mysql/mysql4-1/
oracle database/oracle/
sqlserver database/sqlserver/
sqlserver2005 database/sqlserver/sqlserver2005/
sqlserver2005 database/sqlserver/sqlserver2008/
etc.
Using the like
operator on the path field you can easily get all needed tag rows:
SELECT * FROM tags WHERE path LIKE 'database/%'
There are some implementation details like when you move a node in the hierarchy you have to change all children too etc., but it's not hard.
Also make sure that the length of your path is long enough - in my case I used not the tag name for the path, but another field to make sure that I don't get too long paths.
Ali's answer has a link to Joe Celko's Trees and Hierarchies in SQL for Smarties, which confirms my suspicion - there isn't a simple database structure that offers the best of all worlds. The best for my purpose seems to be the "Frequent Insertion Tree" detailed in this book, which is like the "Nested Set Model" of Ali's link, but with non-consecutive indexing. This allows O(1) insertion (a la unstructured BASIC line numbering), with occasional index reorganisation as and when needed.
You could build what Kimball calls a Hierarchy Helper Table.
Say you hierarchy looks like this: A -> B | B -> C | C -> D
you'd insert records into a table that looks like this
ParentID, ChildID, Depth, Highest Flag, Lowest Flag
A, A, 0, Y, N
A, B, 1, N, N
A, C, 2, N, N
A, D, 3, N, Y
B, B, 0, N, N
B, C, 1, N, N
B, D, 2, N, Y
C, C, 0, N, N
C, D, 1, N, Y
D, D, 0. N, Y
I think I have that correct.... anyways. The point is you still store you hierarchy correctly, you just build this table FROM your proper table. THIS table queries like a Banshee. Say you want to know what all the first level below B are.
WHERE parentID = 'B' and Depth = 1
I would use some kind of array to store the children tags, this should be a lot faster than joining a table on itself (especially if you have a large number of tags). I had a look, and I can't tell if mysql has a native array data type, but you can emulate this by using a text column and storing a serialized array in it. If you want to speed things up further, you should be able to put a text search index on that column to find out which tags are related.
[Edit]
After reading Ali's article, I did some more hunting and found this presentation on a bunch of approaches for implementing hierarchies in postgres. Might still be helpful for explanatory purposes.