Ukkonen's suffix tree algorithm in plain Engli

2019-06-23 17:30发布

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

My basic understanding, so far:

  • I need to iterate through each prefix P of a given string T
  • I need to iterate through each suffix S in prefix P and add that to tree
  • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

I'm also having trouble understanding:

  • exactly when and how the "active point" is assigned, used and changed
  • what is going on with the canonization aspect of the algorithm
  • Why the implementations I've seen need to "fix" bounding variables that they are using

Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

https://gist.github.com/2373868


Update 2017-11-04

After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

7条回答
放我归山
2楼-- · 2019-06-23 18:10

I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an "overview" of jogojapan's answer with some modifications to the rules. I will also describe the case when we forget to create important suffix links.

Additional variables used

  1. active point - a triple (active_node; active_edge; active_length), showing from where we must start inserting a new suffix.
  2. remainder - shows the number of suffixes we must add explicitly. For instance, if our word is 'abcaabca', and remainder = 3, it means we must process 3 last suffixes: bca, ca and a.

Let's use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes.

Observation 1

When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder).

Observation 2

If at some point active_length is greater or equal to the length of current edge (edge_length), we move our active point down until edge_length is strictly greater than active_length.

Now, let's redefine the rules:

Rule 1

If after an insertion from the active node = root, the active length is greater than 0, then:

  1. active node is not changed
  2. active length is decremented
  3. active edge is shifted right (to the first character of the next suffix we must insert)

Rule 2

If we create a new internal node OR make an inserter from an internal node, and this is not the first SUCH internal node at current step, then we link the previous SUCH node with THIS one through a suffix link.

This definition of the Rule 2 is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion.

Rule 3

After an insert from the active node which is not the root node, we must follow the suffix link and set the active node to the node it points to. If there is no a suffix link, set the active node to the root node. Either way, active edge and active length stay unchanged.

In this definition of Rule 3 we also consider the inserts of leaf nodes (not only split-nodes).

And finally, Observation 3:

When the symbol we want to add to the tree is already on the edge, we, according to Observation 1, update only active point and remainder, leaving the tree unchanged. BUT if there is an internal node marked as needing suffix link, we must connect that node with our current active node through a suffix link.

Let's look at the example of a suffix tree for cdddcdc if we add a suffix link in such case and if we don't:

  1. If we DON'T connect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

  2. If we DO connect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there's no a suffix link, then we must put the active_node to the root.

When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.

Finally, here are my implementations of the Suffix Tree:

  1. Java
  2. C++

Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree.

查看更多
登录 后发表回答