Suppose I constructed a suffix array, i.e. an array of integers giving the starting positions of all suffixes of a string in lexicographical order.
Example: For a string str=abcabbca
,
the suffix array is:
suffixArray[] = [7 3 0 4 5 1 6 2]
Explanation:
i Suffix LCP of str and str[i..] Length of LCP
7 a a 1
3 abbca ab 2
0 abcabbca abcabbca 8
4 bbca empty string 0
5 bca empty string 0
1 bcabbca empty string 0
6 ca empty string 0
2 cabbca empty string 0
Now with this suffixArray
constructed, I want to find the length of the Longest Common Prefix (LCP) between str
(the string itself) and each of the other suffixes. What is the most efficient way to do it?
Based on your comment, I'll assume we have access to the suffix array
SA
as well as the standardLCP
array, i.e. a data structure that tells us, at index i>0, what the length of the longest common prefix of suffixSA[i]
and its lexicographic predecessorSA[i-1]
is.I'll use the letter L to refer to the special LCP array we want to construct, as described in the question. I'll use the letter N to refer to the length of the input string
str
.Then what we can do is this:
Determine the position of
str
within the suffix array. We can do this by screeningSA
linearly to find the entry0
. (Explanation:str
is the suffix ofstr
starting at position 0. Therefore,0
must appear as an entry of the suffix array.)Suppose the entry we find is at index k. Then we can set
L[k]:=N
, we becauseSA[k]
is the string itself and has a prefix of N characters in common with itself.Then we can set
L[k-1]:=LCP[k]
andL[k+1]:=LCP[k+1]
because that is how the standard LCP is defined.Then we go backwards from i:=k-2 down to 0 and set
This works because, at each iteration i,
LCP[i+1]
tells us the longest common prefix of the adjacent suffixesSA[i]
andSA[i+1]
, andL[i+1]
tells us the longest common prefix of the previously processed suffixSA[i+1]
and the input stringstr
.L[i]
must be the minimum of those two, becauseL[i]
indicates how long a prefixSA[i]
has in common withstr
, and that cannot be longer than the prefix it has in common withSA[i+1]
, otherwise its position in the suffix array would be closer to k.We also count forward from i:=k+2 to N and set
based on the same reasoning.
Then all N values of
L
have been set, and it took no more than O(N) time, assuming that random access to the arrays and integer comparison are O(1), respectively.Since the array we compute is N entries in length, a complexity of O(N) is optimal.
(Note. You can start the loops in steps 4 and 5 at k-1 and k+1, respectively, and get rid of step 3. The extra step only serves to make the explanation -- hopefully -- a little easier to follow.)