At the outset, let me explain that this question is neither about how to capture groups, nor about how to use quantifiers, two features of regex I am perfectly familiar with. It is more of an advanced question for regex lovers who may be familiar with unusual syntax in exotic engines.
Capturing Quantifiers
Does anyone know if a regex flavor allows you to capture quantifiers? By this, I mean that the number of characters matched by quantifiers such as + and * would be counted, and that this number could be used again in another quantifier.
For instance, suppose you wanted to make sure you have the same number of Ls and Rs in this kind of string: LLLRRRRR
You could imagine a syntax such as
L(+)R{\q1}
where the + quantifier for the L is captured, and where the captured number is referred to in the quantifier for the R as {\q1}
This would be useful to balance the number of {@,=,-,/} in strings such as @@@@ "Star Wars" ==== "1977" ---- "Science Fiction" //// "George Lucas"
Relation to Recursion
In some cases quantifier capture would elegantly replace recursion, for instance a piece of text framed by the same number of Ls and Rs, a in
L(+) some_content R{\q1}
The idea is presented in some details on the following page: Captured Quantifiers
It also discusses a natural extension to captured quantifers: quantifier arithmetic, for occasions when you want to match (3*x + 1) the number of characters matched earlier.
I am trying to find out if anything like this exists.
Thanks in advance for your insights!!!
Update
Casimir gave a fantastic answer that shows two methods to validate that various parts of a pattern have the same length. However, I wouldn't want to rely on either of those for everyday work. These are really tricks that demonstrate great showmanship. In my mind, these beautiful but complex methods confirm the premise of the question: a regex feature to capture the number of characters that quantifers (such as + or *) are able to match would make such balancing patterns very simple and extend the syntax in a pleasingly expressive way.
Update 2 (much later)
I found out that .NET has a feature that comes close to what I was asking about. Added an answer to demonstrate the feature.
Coming back five weeks later because I learned that .NET has something that comes very close to the idea of "quantifier capture" mentioned in the question. The feature is called "balancing groups".
Here is the solution I came up with. It looks long, but it is quite simple.
How does it work?
The first non-capturing group matches the
@
characters. In that non-capturing group, we have three named groups c1, c2 and c3 that don't match anything, or rather, that match an empty string. These groups will serve as three counters c1, c2 and c3. Because .NET keeps track of intermediate captures when a group is quantified, every time an@
is matched, a capture is added to the capture collections for Groups c1, c2 and c3.Next,
[^@=]+
eats up all the characters up to the first=
.The second quantified group
(?<-c1>=)+
matches the=
characters. That group seems to be named-c1
, but-c1
is not a group name.-c1
is.NET syntax to pop one capture from the c1 group's capture collection into the ether. In other words, it allows us to decrement c1. If you try to decrement c1 when the capture collection is empty, the match fails. This ensures that we can never have more=
than@
characters. (Later, we'll have to make sure that we cannot have more@
than=
characters.)The next steps repeat steps 2 and 3 for the
-
and/
characters, decrementing counters c2 and c3.The
[^/]+
eats up the rest of the string.The
(?(c1)(?!))
is a conditional that says "If group c1 has been set, then fail". You may know that(?!)
is a common trick to force a regex to fail. This conditional ensures that c1 has been decremented all the way to zero: in other words, there cannot be more@
than=
characters.Likewise, the
(?(c2)(?!))
and(?(c3)(?!))
ensure that there cannot be more@
than-
and/
characters.I don't know about you, but even this is a bit long, I find it really intuitive.
I don't know a regex engine that can capture a quantifier. However, it is possible with PCRE or Perl to use some tricks to check if you have the same number of characters. With your example:
you can check if
@
=
-
/
are balanced with this pattern that uses the famous Qtax trick, (are you ready?): the "possessive-optional self-referencing group"pattern details:
The main idea
The non-capturing group contains only one
@
. Each time this group is repeated a new character is added in capture groups 2, 3 and 4.the possessive-optional self-referencing group
How does it work?
At the first occurence of the @ character the capture group 2 is not yet defined, so you can not write something like that
(\2 =)
that will make the pattern fail. To avoid the problem, the way is to make the backreference optional:\2?
The second aspect of this group is that the number of character
=
matched is incremented at each repetition of the non capturing group, since an=
is added each time. To ensure that this number always increases (or the pattern fails), the possessive quantifier forces the backreference to be matched first before adding a new=
character.Note that this group can be seen like that: if group 2 exists then match it with the next
=
The recursive way
You need to use overlapped matches, since you will use the @ part several times, it is the reason why all the pattern is inside lookarounds.
pattern details:
The main difference with the precedent pattern is that this one doesn't care about the order of
=
-
and/
groups. (However you can easily make some changes to the first pattern to deal with that, with character classes and negative lookaheads.)Note: For the example string, to be more specific, you can replace the negative lookbehind with an anchor (
^
or\A
). And if you want to obtain the whole string as match result you must add.*
at the end (otherwise the match result will be empty as playful notices it.)