I'm trying to match nested {}
brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:
my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;
foreach (@matches) {
print "$_\n";
}
At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}"
I obtain:
abc {{xyz} abc} {xyz}
as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc"
, the output is:
{abc} {{xyz}} abc
which is not what I expected. I would have wanted {abc}
and {{xyz}}
to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?
Just modifies and extends the classic solution a bit:
Demo (This is in PCRE. The behavior is slightly different from Perl when it comes to recursive regex, but I think it should produce the same result for this case).
After some struggle (I am not familiar with Perl!), this is the demo on ideone.
$&
refers to the string matched by the whole regex.Note that this solution assumes that the input is valid. It will behave rather randomly on invalid input. It can be modified slightly to halt when invalid input is encountered. For that, I need more details on the input format (preferably as a grammar), such as whether
abc{xyz}asd
is considered valid input or not.Wow. What a bunch of complicated answers to something that simple.
The problem you're having is that you're matching in greedy mode. That is, you are aking the regex engine to match as much as possible while making the expression true.
To avoid greedy match, just add a '?' after your quantifier. That makes the match as short as possible.
So, I changed your expression from:
To:
...and now it works exactly as you're expecting.
HTH
Francisco
You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:
As you can see, the problem is that /
\{.*\}
/ matches too much. What should be in there is a something that matcheswhere the
...
isSo you need some recursion. Named groups are an easy way of doing this.
But instead of reinventing the wheel, why not use Text::Balanced.
You need a recursive regex. This should work:
or, if you prefer the non-loop version:
To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g.
{1{2{3}}}
, you could useTo match when there may be multiple pairs at any level of nesting, e.g.
{1{2}{2}{2}}
, you could useThe
(?R)
is used to match the whole pattern recursively.To match the text contained within a pair of brackets the engine must match
(?:[^{}]*|(?R))*
,i.e. either
[^{}]*
or(?R)
, zero or more times*
.So in e.g.
"{abc {def}}"
, after the opening"{"
is matched, the[^{}]*
will match the"abc "
and the(?R)
will match the"{def}"
, then the closing"}"
will be matched.The
"{def}"
is matched because(?R)
is simply short for the whole pattern(?>\{(?:[^{}]*|(?R))*\})|\w+
, which as we have just seen will match a"{"
followed by text matching[^{}]*
, followed by"}"
.Atomic grouping
(?>
...)
is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.
But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.
Unfortunately it does not handle escaping on bracketed delimiters.
You can think of extract_multiple like a more generic and powerful split.