Sorting tree with a materialized path?

2019-01-13 11:58发布

问题:

I have a tree structure in a table and it uses materialized paths to allow me to find children quickly. However, I also need to sort the results depth-first, as one would expect with threaded forum replies.

 id | parent_id | matpath |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  7 |         1 | 1       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695

So the final results should actually be sorted like this:

 id | parent_id | matpath |          created
----+-----------+---------+----------------------------
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  7 |         1 | 1       | 2010-05-08 18:18:11.849735

How would I work that out? Can I do that in straight SQL (this is PostgreSQL 8.4) or should additional information be added to this table?

Update: trying to explain sort criteria better.

Imagine that id '1' is the root post to a forum and everything with a 'matpath' beginning with '1' is a child of that post. So ids 2 through 5 are direct replies to 1 and get matpaths of '1'. However, id 6 is a reply 2, not directly to 1, so it gets a matpath of 1.2. This means that for a threaded forum with proper nesting, with all ids shown in the tables, the structure of the forum would look like this, hence the ordering requirement:

* id 1 (root post)
    * id 2
        * id 6
            * id 8
    * id 3
    * id 4
    * id 5
        * id 9
    * id 7

回答1:

I typically create an additional columnn for this, called something like SortPath. It would contain the data that you need to sort by, concatenated together. That column would be of type varchar, and get sorted as a string. Something like this:

id | parent_id | matpath |          created            |                   sortpath
---+-----------+---------+-----------------------------+--------------------------------------------------------------------------------------
 2 |         1 | 1       | 2010-05-08 15:18:37.987544  | 2010-05-08 15:18:37.987544-2
 6 |         2 | 1.2     | 2010-05-08 17:50:43.288759  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6
 8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695  | 2010-05-08 15:18:37.987544-2.2010-05-08 17:50:43.288759-6.2010-05-09 14:01:17.632695-8
 3 |         1 | 1       | 2010-05-08 17:38:14.125377  | 2010-05-08 17:38:14.125377-3
 4 |         1 | 1       | 2010-05-08 17:38:57.26743   | 2010-05-08 17:38:57.267430-4 
 5 |         1 | 1       | 2010-05-08 17:43:28.211708  | 2010-05-08 17:43:28.211708-5
 9 |         5 | 1.5     | 2010-05-09 14:02:43.818646  | 2010-05-08 17:43:28.211708-5.2010-05-09 14:02:43.818646-9
 7 |         1 | 1       | 2010-05-08 18:18:11.849735  | 2010-05-08 18:18:11.849735-7

A couple of things to note here:

  • sortpath will be sorted as a string, so it is important all dates have the same length for it to correctly sort. E.g., observe how 2010-05-08 17:38:57.26743 has an extra zero added in the sortpath column.
  • I have appended the PK of each node to the end of its date. This is so that if you happen to have two rows with the exact same date, they will always get returned in the same order due to the additional data we are appending.
  • To me, the data looks asymmetrical the way I have written it, because we are showing the current node's date in sortpath, but it is not in matpath. I would prefer to see it in both.
  • You may want to put the date of node ID 1 at the beginning of each sortcolumn as well. This is so that if you ever want to query for more than one forum at a time (you probably won't), then it will still sort correctly.


回答2:

I believe your materialized path is not right.

What logic do you get to sort things like this

1
1.2
1
1.5

Why is the second 1 not together with the first one?

If you had

1
1.2
2
2.5

This would be trivial.

EDIT: I have looked at your example and you are not storing materialized path of a row, but you are storing a materialized path of the parent row. Here's how the materialized path of the row actually should look like. Sorting directly on matpath would work if you would not have more than 9 branches if you stored it as:

 id | parent_id | matpath   |          created
----+-----------+-----------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735

otherwise (>9) you would have to turn the matpath into something like

001.002.006
001.002.006.008

that would support up to 999 branches.

Please note

  • even the approach with 4 fixed digits, such as 0001.0002.0006 would give you a field that is shorter then in the accepted answer
  • you could parse matpath an produce sorting value on the fly with a user function
  • you could directly store matpath in this format (it has some other nice properties, too)


回答3:

I'm not sure I understand why the accepted solution makes any sense at all. It works, but it is even less normalized and less efficient (more disk space, more indexes, etc) than @Unreason's solution (to just pad the ID's in the materialized path).

The whole scenario that the OP faces seems to stem from the fact that, as @Unreason correctly points out, the implementation of the materialized path (MP) is incorrect. The OP has provided the MP to the parent, not to the current node. In the accepted solution the SortPath column corrects this by providing the materialized path to the current node (this time using dates -- why? -- instead of the primary key).

For reference consider the following excerpt:

Materialized Path

In this approach each record stores the whole path to the root. In our previous example, lets assume that KING is a root node. Then, the record with ename = 'SCOTT' is connected to the root via the path SCOTT->JONES->KING. Modern databases allow representing a list of nodes as a single value, but since materialized path has been invented long before then, the convention stuck to plain character string of nodes concatenated with some separator; most often '.' or '/'.



回答4:

While @Unreason's answer about padding works, I would like to offer another solution which I believe is my own invention to this problem.

You are looking for function creating a bytestream, an f(x)=b_1b_2..b_i (sorry no MatJax on SO) where b_i is a byte. We know two bytestream compares the same as the first differing byte. We want such a function that f(x)<f(y) iff x<y.

Padding to the same length with 0 definitely obtains this goal, easily. You take two numbers, look at the first nonzero byte and there you are.

Steven Wittens (acko.net) introduced a different trick to Drupal core some eight years ago: put the number of digits in front of the string as another digit. So, the number 97685 becomes the characters 5 9 7 6 8 5. This also works: look at the length byte first, if they are not the same then the larger will definitely be the larger. Beyond that, you know the two numbers are equal length. He also used base 36 numbers with 0-9a-z being the digits, much like hex just for every letter. This encoding needs two bytes for the first 36 nodes, three for the next 1260...

Note that neither padding nor this cunning variable length encoding need separators for the materialized path although for readability they are often included.

numconv supports a base85 encoding but that requires a case sensitive collation. There are 94 ASCII characters if you remove lower case letters you still have base68.

But if you use a 'binary' field then you can do base256: instead of a cunning encoding just write the number as a series of bytes and then prefix the whole thing with the length of the bytestream as a single byte. This will allow you to encode any tree smaller than 2^2048 or so. For the first 256 nodes, you are using two bytes, for the next 65280 you are looking at three bytes. This is already quite efficient.

I nominate the utf8encode(x) function. Consider that! You need to descend into bitsorting instead of bytesorting but that doesn't change the outcome: find the leftmost zero bit. If the other string has a 1 there, it'll be a longer UTF-8 encoding so definitely that's larger. If they have the first zero at the same place then we have same length bit strings which compare well for us.

That's nice but what about separators. The UTF-8 algorithm when looking at it as purely an algorithm creating bitstreams can handle 31 bit numbers -- so it'll work for trees containing less than two billion nodes. Your materialized path will be a bitstream of UTF-8 encoded numbers which compare well: Discard the leftmost identical UTF-8 encoded numbers and we are back at the previous paragraph. Q.E.D.

Because we don't need separators or prefix bytes, we can encode the first 128 nodes into a single byte then the next 1920 into two bytes, and up to 65535, three bytes. For four bytes, base256 will win. For really big trees, you can treat UTF-8 as an encoder of 0-2147483647 into a byte stream. So you can use it as base2147483647 encoding :D

To summarize: UTF-8 is best for small trees and not much worse than base256 below two billion nodes. Beyond that until 2^2048 or so prefixed-with-length-base256 wins. Beyond that prefixed-with-length-base2147483647 wins and there's nothing beyond that.



回答5:

I can't think of a simple way to do this in straight SQL. Rather than matpath, I will use node_path here. node_path is matpath||'.'||id

 id | parent_id | node_path |          created           
----+-----------+---------+----------------------------
  2 |         1 | 1.2       | 2010-05-08 15:18:37.987544
  3 |         1 | 1.3       | 2010-05-08 17:38:14.125377
  4 |         1 | 1.4       | 2010-05-08 17:38:57.26743
  5 |         1 | 1.5       | 2010-05-08 17:43:28.211708
  7 |         1 | 1.7       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2.6     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5.9     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6.8   | 2010-05-09 14:01:17.632695

Now you want to order the tree based on node_path, with the sorting field defined by the number of times you have run the sort.

A custom recursive function in plpgsql sorting on split_part(node_path, '.', recursion_depth) will work. You will have to check for NULL values from split_part (and ignore those).