Can someone explain how this code for constructing the LCP from a suffix array works? suffixArr[]
is an array such that suffixArr[i]
holds the value of the index in the string for the suffix with rank i.
void LCPconstruct()
{
int i,C[1001],l;
C[suffixArr[0]] = n;
for(i=1;i<n;i++)
C[suffixArr[i]] = suffixArr[i-1];
l = 0;
for(i=0;i<n;i++)
{
if(C[i]==n)
LCPadj[i] = 0;
else
{
while(i+l<n && C[i]+l<n && s[i+l] == s[C[i]+l])
l++;
LCPadj[i] = l;
l = max(l-1,0);
}
}
for(i=0;i<n;i++)
cout<<LCPadj[suffixArr[i]]<<"\n";
}
First, it's important to realize that the algorithm processes the suffixes in the original order, i.e. the order in which they appear in the input string. Not in lexicographical order.
So if the input string is
abxabc
, it first considersabxabc
, thenbxabc
, thenxabc
and so forth.For each suffix it considers in this order, it determines the position of the suffix that is its lexicographical predecessor(*) (so here, and only here, it uses the concept of lexicographical order). For the first suffix
abxabc
, the lexicographical predecessor, i.e. the suffix that appears directly before it in the lexicographical ordering of suffixes, isabc
. It determines this by an O(1) lookup in the arrayC
, which has been prepared specifically for this purpose.The inner loop compares the characters of
abxabc
andabc
one by one and determines that these two suffixes have the first 2 characters in common. This is the variablel
in your code, and it means that the entry in LCP for the suffixabxabc
must be 2, so we setLCPadj[i] = l
. Note thati
here refers to the position of the suffix in the input string, not to its position in the suffix array. SoLCPadj
is not the LCP array (yet). It's an auxiliary data structure.Then it proceeds to the next string, which is
bxabc
. Again it usesC
to find thatbc
is the lexicographical predecessor of that and then determines how many prefix characters the two share. And here comes the trick: It can be sure that this must be at least as many as in the previous step (i.e. 2), minus 1. Why? Because the string we currently consider,bxabc
, is of course a suffix of the string previously considered (abxabc
), therefore the lexicographical predecessor of that string (abc
) must also have a suffix that is 1 character shorter (bc
), and that suffix must also be somewhere in the suffix array, and it must share its prefix with the currently considered string, minus the first character. Moreover, there can't be any other suffix that is both shorter and lexicographically closer to the string currently considered. The latter is quite logical if you think about how lexicographical sorting works, but there are also formal proofs of this (e.g. Lemma 5.10 in Kärkkäinen's lecture here)So that describes the main principle at work here. There are a few things to note about your code to fully understand the role of each variable:
C
is an auxiliary array (n
integers in length) that stores, for each suffix in the input string, the position of that other suffix that is its immediate lexicographical predecessor. This array is constructed not from left to right, but (wisely) by going through the suffix array from left to right, because that makes it easy to determine the immediate lexicogaphical predecessor of any string: The immediate lexicographical predecessor of the suffix starting at positionsuffixArr[i]
must of course be located at positionsuffixArr[i-1]
. Confirm in your code that this is howC
is defined.LCPadj
stores LCP values for the suffix in the order in which they appear in the input string, not the order in which they appear in the suffix array. That is why at output time,LCPadj
is not printed from left to right, but by going through the suffix array from left to right, and printingLCPadj[i]
in that order. Confirm that this is the case.I hope this helps. Let me know if not.
(*)By lexicographical predecessor I mean the immediate predecessor of the suffix in the lexicographically ordered list of suffixes, i.e. the suffix to its immediate left in the suffix array.