Finding out whether there exist two identical subs

2020-08-17 06:51发布

问题:

We've got a String.

ABAEABABEABE

Now we've got to check whether there exist a substring that is followed next by another substring that's exactly the same as the first one.

In this example: ABAEABABEABE
ABE is followed by ABE and that are two identical substrings.

In this example:

AAB

It would be simply A, beacuse A is followed by another A.

In this example:
ABCDEFGHIJKLMNO
There doesn't exist such a substring, so the answer would be NO.


I only managed to find an algorithm that would run in O(n^2). That is getting Hashes and its prefixes. Then for each letter we expand simply and check all the words ending on that letter. There are n letters. We need to expand it n times. So it's O(n^2). I believe there should be a O(n log n) algorithm for this problem.

Does anybody have a better idea?

回答1:

I guess you want the longest substring possible that follows this pattern.

The first thing to do is to build a Suffix tree of the input string. Using Ukkonen's algorithm, this is O(n).

Now, how does the condition you provided translate in the suffix tree? First things first, you are looking for a repeated substring [1]. Repeated substrings will appear as internal nodes of the suffix tree. The maximum number of nodes in a suffix tree built from a n-char string is 2n - 1.

You can build a Max-Heap containing such repeated substrings, using their length (number of chars). You do NOT keep substrings of length superior to N/2 (see [1]). This is O(N) where N is the number of internal nodes of the suffix tree. For any suffix tree:

0 ≤ Nn - 2

Now, you take the max out of the priority queue and process the internal node i you obtained:

  1. Let Si be the substring related to i, k = 0 and curnode = i
  2. While k < length(Si)
    1. If the key from i to a child of i is equal to Si[k], then k = k+1
    2. Else break the loop.
  3. If k == length(Si), then the substring is a match. Else, you proceed to the next substring.

Complexity summary

Let n be the length of the query string.

  • Building the suffix tree : O(n)
  • Building the Max-heap of repeated substrings: [3]
    • Identifying the repeated substrings (ie. internal nodes) and storing them in an array: O(n)
    • Heapify the array: O(n)
  • Finding the best match: O(n².log(n)) [2]

Hence the overall worst case complexity is the sum of the above and is O(n².log(n)).

Notes

I made the algorithm above... Hence it is suboptimal, if you are brave enough, you can go through this paper that describes a linear time algorithm! In any case, Suffix trees are a key to this problem so I suggest you study them thoroughly.

[1]: Warning, repeated substrings may partially overlap!

[2]: Actually, the worst case complexity is better than this very naive upper bound but I don't know how to prove it (yet?!). For example, if there were n - 2 internal nodes, that would mean that the original string consists of n occurrences of the same character. In that case, the first substring we check is a match => it's O(n.log(n)).

[3]: If we replace the heap construction by a regular sort (O(n.log(n))), the final comparison step runs in O(n²) instead of O(n².log(n)) ... Taking down the overall complexity between O(n.log(n)) (due to the sorting step) and O(n²).



回答2:

This problem could be solved with 'divide and conquer' Main-Lorentz algorithm:
Michael Main, Richard J. Lorentz. An O (n log n) Algorithm for Finding All Repetitions in a String [1982]

Edit: algorithm description and C++ implementation in Russian (might be translated with Chrome browser)

There exists also linear-time algorithm (don't know about practical implementations)