Longest Common Prefixes

2019-05-25 19:59发布

问题:

Suppose I constructed a suffix array, i.e. an array of integers giving the starting positions of all suffixes of a string in lexicographical order.

Example: For a string str=abcabbca,

the suffix array is:

suffixArray[] = [7 3 0 4 5 1 6 2]

Explanation:

i   Suffix      LCP of str and str[i..]   Length of LCP
7   a           a                           1
3   abbca       ab                          2
0   abcabbca    abcabbca                    8
4   bbca        empty string                0
5   bca         empty string                0
1   bcabbca     empty string                0
6   ca          empty string                0
2   cabbca      empty string                0

Now with this suffixArray constructed, I want to find the length of the Longest Common Prefix (LCP) between str (the string itself) and each of the other suffixes. What is the most efficient way to do it?

回答1:

Based on your comment, I'll assume we have access to the suffix array SA as well as the standard LCP array, i.e. a data structure that tells us, at index i>0, what the length of the longest common prefix of suffix SA[i] and its lexicographic predecessor SA[i-1] is.

I'll use the letter L to refer to the special LCP array we want to construct, as described in the question. I'll use the letter N to refer to the length of the input string str.

Then what we can do is this:

  1. Determine the position of str within the suffix array. We can do this by screening SA linearly to find the entry 0. (Explanation: str is the suffix of str starting at position 0. Therefore, 0 must appear as an entry of the suffix array.)

  2. Suppose the entry we find is at index k. Then we can set L[k]:=N, we because SA[k] is the string itself and has a prefix of N characters in common with itself.

  3. Then we can set L[k-1]:=LCP[k] and L[k+1]:=LCP[k+1] because that is how the standard LCP is defined.

  4. Then we go backwards from i:=k-2 down to 0 and set

    L[i] := min(LCP[i+1],L[i+1])
    

    This works because, at each iteration i, LCP[i+1] tells us the longest common prefix of the adjacent suffixes SA[i] and SA[i+1], and L[i+1] tells us the longest common prefix of the previously processed suffix SA[i+1] and the input string str. L[i] must be the minimum of those two, because L[i] indicates how long a prefix SA[i] has in common with str, and that cannot be longer than the prefix it has in common with SA[i+1], otherwise its position in the suffix array would be closer to k.

  5. We also count forward from i:=k+2 to N and set

    L[i] := min(LCP[i],L[i-1])
    

    based on the same reasoning.

Then all N values of L have been set, and it took no more than O(N) time, assuming that random access to the arrays and integer comparison are O(1), respectively.

Since the array we compute is N entries in length, a complexity of O(N) is optimal.

(Note. You can start the loops in steps 4 and 5 at k-1 and k+1, respectively, and get rid of step 3. The extra step only serves to make the explanation -- hopefully -- a little easier to follow.)