I am looking for an efficient way to extract the shortest repeating substring.
For example:
input1 = 'dabcdbcdbcdd'
ouput1 = 'bcd'
input2 = 'cbabababac'
output2 = 'ba'
I would appreciate any answer or information related to the problem.
Also, in this post, people suggest that we can use the regular expression like
re=^(.*?)\1+$
to find the smallest repeating pattern in the string. But such expression does not work in Python and always return me a non-match (I am new to Python and perhaps I miss something?).
--- follow up ---
Here the criterion is to look for shortest non-overlap pattern whose length is greater than one and has the longest overall length.
A quick fix for this pattern could be
(.+?)\1+
Your regex failed because it anchored the repeating string to the start and end of the line, only allowing strings like abcabcabc
but not xabcabcabcx
. Also, the minimum length of the repeated string should be 1, not 0 (or any string would match), therefore .+?
instead of .*?
.
In Python:
>>> import re
>>> r = re.compile(r"(.+?)\1+")
>>> r.findall("cbabababac")
['ba']
>>> r.findall("dabcdbcdbcdd")
['bcd']
But be aware that this regex will only find non-overlapping repeating matches, so in the last example, the solution d
will not be found although that is the shortest repeating string. Or see this example: here it can't find abcd
because the abc
part of the first abcd
has been used up in the first match):
>>> r.findall("abcabcdabcd")
['abc']
Also, it may return several matches, so you'd need to find the shortest one in a second step:
>>> r.findall("abcdabcdabcabc")
['abcd', 'abc']
Better solution:
To allow the engine to also find overlapping matches, use
(.+?)(?=\1)
This will find some strings twice or more, if they are repeated enough times, but it will certainly find all possible repeating substrings:
>>> r = re.compile(r"(.+?)(?=\1)")
>>> r.findall("dabcdbcdbcdd")
['bcd', 'bcd', 'd']
Therefore, you should sort the results by length and return the shortest one:
>>> min(r.findall("dabcdbcdbcdd") or [""], key=len)
'd'
The or [""]
(thanks to J. F. Sebastian!) ensures that no ValueError
is triggered if there's no match at all.
^
matches at the start of a string. In your example the repeating substrings don't start at the beginning. Similar for $
. Without ^
and $
the pattern .*?
always matches empty string. Demo:
import re
def srp(s):
return re.search(r'(.+?)\1+', s).group(1)
print srp('dabcdbcdbcdd') # -> bcd
print srp('cbabababac') # -> ba
Though It doesn't find the shortest substring.