I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string.
Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurrences, you simply compute the LCP for each successive entry in the suffix array.
Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here.
For example for this suffix array for banana
:
LCP Suffix entry
N/A a
1 ana
3 anana
0 banana
0 na
2 nana
How does the LCP help find the number of occurrences of a substring like "banana" or "na" is not obvious to me.
Any help figuring out how LCP helps here?
I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.
The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),
The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).
The LCP-LR array helps improve this to O(m+log N), in the following way:
So, in the next step, you consider (M,...,R) and a new central point M' in the middle:
The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').
You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:
We continue recursively.
The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).
Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?
The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.
Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.
To sum up:
The Longest Common Prefix (LCP) is the Lowest Common Ancestor (LCA) in a suffix tree. Once you have the Lowest Common Ancestor, you can count the number of nodes that branch out from the LCA. This will give you the number of occurrences of a pattern in the suffix tree. This is the relationship between the LCP and LCA.