I'd like to find the longest repeating string within a string, implemented in JavaScript and using a regular-expression based approach.
I have an PHP implementation that, when directly ported to JavaScript, doesn't work.
The PHP implementation is taken from an answer to the question "Find longest repeating strings?":
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $input, $matches, PREG_SET_ORDER);
This will populate $matches[0][X]
(where X
is the length of $matches[0]
) with the longest repeating substring to be found in $input
. I have tested this with many input strings and found am confident the output is correct.
The closest direct port in JavaScript is:
var matches = /(?=((.+)(?:.*?\2)+))/.exec(input);
This doesn't give correct results
input Excepted result matches[0][X]
======================================================
inputinput input input
7inputinput input input
inputinput7 input input
7inputinput7 input 7
XXinputinputYY input XX
I'm not familiar enough with regular expressions to understand what the regular expression used here is doing.
There are certainly algorithms I could implement to find the longest repeating substring. Before I attempt to do that, I'm hoping a different regular expression will produce the correct results in JavaScript.
Can the above regular expression be modified such that the expected output is returned in JavaScript? I accept that this may not be possible in a one-liner.
Javascript matches only return the first match -- you have to loop in order to find multiple results. A little testing shows this gets the expected results:
function maxRepeat(input) {
var reg = /(?=((.+)(?:.*?\2)+))/g;
var sub = ""; //somewhere to stick temp results
var maxstr = ""; // our maximum length repeated string
reg.lastIndex = 0; // because reg previously existed, we may need to reset this
sub = reg.exec(input); // find the first repeated string
while (!(sub == null)){
if ((!(sub == null)) && (sub[2].length > maxstr.length)){
maxstr = sub[2];
}
sub = reg.exec(input);
reg.lastIndex++; // start searching from the next position
}
return maxstr;
}
// I'm logging to console for convenience
console.log(maxRepeat("aabcd")); //aa
console.log(maxRepeat("inputinput")); //input
console.log(maxRepeat("7inputinput")); //input
console.log(maxRepeat("inputinput7")); //input
console.log(maxRepeat("7inputinput7")); //input
console.log(maxRepeat("xxabcdyy")); //x
console.log(maxRepeat("XXinputinputYY")); //input
Note that for "xxabcdyy" you only get "x" back, as it returns the first string of maximum length.
It seems JS regexes are a bit weird. I don't have a complete answer, but here's what I found.
Although I thought they did the same thing re.exec() and "string".match(re) behave differently. Exec seems to only return the first match it finds, whereas match seems to return all of them (using /g in both cases).
On the other hand, exec seems to work correctly with ?= in the regex whereas match returns all empty strings. Removing the ?= leaves us with
re = /((.+)(?:.*?\2)+)/g
Using that
"XXinputinputYY".match(re);
returns
["XX", "inputinput", "YY"]
whereas
re.exec("XXinputinputYY");
returns
["XX", "XX", "X"]
So at least with match you get inputinput as one of your values. Obviously, this neither pulls out the longest, nor removes the redundancy, but maybe it helps nonetheless.
One other thing, I tested in firebug's console which threw an error about not supporting $1, so maybe there's something in the $ vars worth looking at.