I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.
A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.
For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
My basic understanding, so far:
- I need to iterate through each prefix P of a given string T
- I need to iterate through each suffix S in prefix P and add that to tree
- To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.
The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.
I'm also having trouble understanding:
- exactly when and how the "active point" is assigned, used and changed
- what is going on with the canonization aspect of the algorithm
- Why the implementations I've seen need to "fix" bounding variables that they are using
Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:
Update 2017-11-04
After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk
from the same location, and then run with node.js to see some colourful output. There's a stripped down version in the same Gist, without any of the debugging code.
https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6
I tried to implement the Suffix Tree with the approach given in jogojapan's answer, but it didn't work for some cases due to wording used for the rules. Moreover, I've mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an "overview" of jogojapan's answer with some modifications to the rules. I will also describe the case when we forget to create important suffix links.
Additional variables used
Let's use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes.
Observation 1
When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the
active point
andremainder
).Observation 2
If at some point
active_length
is greater or equal to the length of current edge (edge_length
), we move ouractive point
down untiledge_length
is strictly greater thanactive_length
.Now, let's redefine the rules:
Rule 1
Rule 2
This definition of the
Rule 2
is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion.Rule 3
In this definition of
Rule 3
we also consider the inserts of leaf nodes (not only split-nodes).And finally, Observation 3:
When the symbol we want to add to the tree is already on the edge, we, according to
Observation 1
, update onlyactive point
andremainder
, leaving the tree unchanged. BUT if there is an internal node marked as needing suffix link, we must connect that node with our currentactive node
through a suffix link.Let's look at the example of a suffix tree for cdddcdc if we add a suffix link in such case and if we don't:
If we DON'T connect the nodes through a suffix link:
If we DO connect the nodes through a suffix link:
Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don't put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the
Rule 3
, because, according to it, if there's no a suffix link, then we must put theactive_node
to the root.When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled 'c'). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the
active node
was set to the red node. But we don't make an insert from the red node, as the letter 'c' is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.Finally, here are my implementations of the Suffix Tree:
Hope that this "overview" combined with jogojapan's detailed answer will help somebody to implement his own Suffix Tree.