EDIT: I selected ridgerunner's answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the specific question in case someone else wants to fully understand the example too. You will find it somewhere below.
This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)
a(?:(?R)|a?)a
This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:
aa*a
would work great. But the point is to use recursion.
Here is a piece of code you can run to test my failing pattern:
<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>
In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.
In my mind, for "aaaa", this should match "aaaa".
But here is the output:
a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa
Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.
I realise that
#(?:a|a(?R)a)*#
kind of works, but my question is why the other pattern doesn't.
Thanks heaps!
Okay, I finally have it.
I awarded the correct answer to ridgerunner as he put me on the path to the solution, but I also wanted to write a full answer to the specific question in case someone else wants to fully understand the example too.
First the solution, then some notes.
A. Solution
Here is a summary of the steps followed by the engine. The steps should be read from top to bottom. They are not numbered. The recursion depth is shown in the left column, going up from zero to for and back down to zero. For convenience, the expression is shown at the top right. For ease of readability, the "a"s being matched are shown at their place in the string (which is shown at the very top).
B. Notes
1. The source of confusion
Here is what was counter-intuitive about it for me.
We are trying to match a a a a
I assumed that depth 0 of the recursion would match as a - - a and that depth 1 would match as - a a -
But in fact depth 1 first matches as - a a a
So depth 0 has nowhere to go to finish the match:
...then what? We are out of characters but the expression is not over.
So depth 1 is discarded. Note that depth 1 is not attempted again by giving back characters, which would lead us to a different depth 1 match of - a a -
That's because recursive matches are atomic. Once a depth matches, it's all or nothing, you keep it all or you discard it all.
Once depth 1 is discarded, depth 0 moves on to the other side of the alternation, and returns the match: a a a
2. The source of clarity
What helped me the most was the example that ridgerunner gave. In his example, he showed how to trace the path of the engine, which is exactly what I wanted to understand.
Following this method, I traced the full path of the engine for our specific example. As I have it, the path is 25 steps long, so it is considerably longer than the summary above. But the summary is accurate to the path I traced.
Big Thanks to everyone else who contributed, in particular Wiseguy for a very intriguing presentation. I still wonder if somehow I might be missing something and Wiseguy's answer might amount to the same!
Excellent (and difficult) question!
First, with the PCRE regex engine, the
(?R)
behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole(?R)
expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of:aaaa
as the subject text, lets use:abcd
. And lets change the regex from'#a(?:(?R)|a?)a#'
to:'#.(?:(?R)|.?).#'
. The regex engine matching behavior is the same.Matching regex:
/.(?:(?R)|.?)./
to:"abcd"
There is nothing wrong with the regex engine. The correct match is
abc
(oraaa
for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.After a lot of experimentation I think the PHP regex engine is broken. The exact same code under Perl works fine and matches all of your strings from beginning to end as I would expect.
Recursive regexes are hard on the imagination, but it looks to me as if
/a(?:(?R)|a?)a/
should matchaaaa
as ana
..a
pair containing a seconda
..a
pair, after which a second recursion fails and the alternate /a?/ matches instead as a null string.IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.
Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.
Since your outer
a
s are explicitly there, it will match ana
between twoa
s, or a previous recursion's match of the entire pattern between twoa
s. As a result, it will only match odd numbers ofa
s (middle one plus multiples of two).At length of three,
aaa
is the current recursion's matching pattern, so on the fourth recursion it's looking for ana
between twoa
s (i.e.,aaa
) or the previous recursion's matched pattern between twoa
s (i.e.,a
+aaa
+a
). Obviously it can't match fivea
s when the string isn't that long, so the longest match it can make is three.Similar deal with a length of six, as it can only match the "default"
aaa
or the previous recursion's match surrounded bya
s (i.e.,a
+aaaaa
+a
).However, it does not match all odd lengths.
Since you're matching recursively, you can only match the literal
aaa
ora
+(prev recurs match)+a
. Each successive match will therefore always be twoa
s longer than the previous match, or it will punt and fall back toaaa
.At a length of seven (matching against
aaaaaaa
), the previous recursion's match was the fallbackaaa
. So this time, even though there are sevena
s, it will only match three (aaa
) or five (a
+aaa
+a
).When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):
What's going on here? Well, I'll tell you! :-)
When a recursive match would be one character longer than the input string, it punts back to
aaa
, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back toaaa
. And so on.Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:
For reasons that should now make sense, this happens at multiples of 2.
Stepping through by hand
I've slightly simplified the original pattern for this example. Remember this. We will come back to it.
What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of
(?R)
as many times as possible.When tracing this by hand, you could work from the inside out. In
(?R)|a
,a
is your base case. So we'll start with that.If that matches the input string, take that match (
aaa
) back to the original expression and put it in place of(?R)
.If the input string is matched with our recursive value, subtitute that match (
aaaaa
) back into the original expression to recurse again.Repeat until you can't match your input using the result of the previous recursion.
Example
Input:
aaaaaa
Regex:
a((?R)|a)a
Start at base case,
aaa
.Does the input match with this value? Yes:
aaa
Recurse by putting
aaa
in the original expression:Does the input match with our recursive value? Yes:
aaaaa
Recurse by putting
aaaaa
in the original expression:Does the input match with our recursive value? No:
aaaaaaa
Then we stop here. The above expression could be rewritten (for simplicity) as:
Since it doesn't match
aaaaaaa
, it must matchaaa
. We're done,aaa
is the final result.