Perl regular expression: match nested brackets

2020-02-03 07:33发布

I'm trying to match nested {} brackets with a regular expressions in Perl so that I can extract certain pieces of text from a file. This is what I have currently:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

foreach (@matches) {
    print "$_\n";
}

At certain times this works as expected. For instance, if $str = "abc {{xyz} abc} {xyz}" I obtain:

abc
{{xyz} abc}
{xyz}

as expected. But for other input strings it does not function as expected. For example, if $str = "{abc} {{xyz}} abc", the output is:

{abc} {{xyz}}
abc

which is not what I expected. I would have wanted {abc} and {{xyz}} to be on separate lines, since each is balanced on its own in terms of brackets. Is there an issue with my regular expression? If so, how would I go about fixing it?

标签: regex perl
7条回答
我只想做你的唯一
2楼-- · 2020-02-03 07:50

Just modifies and extends the classic solution a bit:

(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++

Demo (This is in PCRE. The behavior is slightly different from Perl when it comes to recursive regex, but I think it should produce the same result for this case).

After some struggle (I am not familiar with Perl!), this is the demo on ideone. $& refers to the string matched by the whole regex.

my $str = "abc {{xyz} abc} {xyz} {abc} {{xyz}} abc";

while ($str =~ /(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {
    print "$&\n"
}

Note that this solution assumes that the input is valid. It will behave rather randomly on invalid input. It can be modified slightly to halt when invalid input is encountered. For that, I need more details on the input format (preferably as a grammar), such as whether abc{xyz}asd is considered valid input or not.

查看更多
别忘想泡老子
3楼-- · 2020-02-03 07:51

Wow. What a bunch of complicated answers to something that simple.

The problem you're having is that you're matching in greedy mode. That is, you are aking the regex engine to match as much as possible while making the expression true.

To avoid greedy match, just add a '?' after your quantifier. That makes the match as short as possible.

So, I changed your expression from:

my @matches = $str =~ /\{(?:\{.*\}|[^\{])*\}|\w+/sg;

To:

my @matches = $str =~ /\{(?:\{.*?\}|[^\{])*?\}|\w+/sg;

...and now it works exactly as you're expecting.

HTH

Francisco

查看更多
家丑人穷心不美
4楼-- · 2020-02-03 07:55

You were surprised how your pattern matched, but noone explained it? Here's how your pattern is matching:

my @matches = $str =~ /\{(?:\{.*\}|[^{])*\}|\w+/sg;
                       ^    ^ ^ ^  ^      ^
                       |    | | |  |      |
{ ---------------------+    | | |  |      |
a --------------------------)-)-)--+      |
b --------------------------)-)-)--+      |
c --------------------------)-)-)--+      |
} --------------------------)-)-)--+      |
  --------------------------)-)-)--+      |
{ --------------------------+ | |         |
{ ----------------------------+ |         |
x ----------------------------+ |         |
y ----------------------------+ |         |
z ----------------------------+ |         |
} ------------------------------+         |
} ----------------------------------------+

As you can see, the problem is that /\{.*\}/ matches too much. What should be in there is a something that matches

(?: \s* (?: \{ ... \} | \w+ ) )*

where the ... is

(?: \s* (?: \{ ... \} | \w+ ) )*

So you need some recursion. Named groups are an easy way of doing this.

say $1
   while /
      \G \s*+ ( (?&WORD) | (?&BRACKETED) )

      (?(DEFINE)
         (?<WORD>      \s* \w+ )
         (?<BRACKETED> \s* \{ (?&TEXT)? \s* \} )
         (?<TEXT>      (?: (?&WORD) | (?&BRACKETED) )+ )
      )
   /xg;

But instead of reinventing the wheel, why not use Text::Balanced.

查看更多
劫难
5楼-- · 2020-02-03 08:04

You need a recursive regex. This should work:

my @matches;
push @matches, $1 while $str =~ /( [^{}\s]+ | ( \{ (?: [^{}]+ | (?2) )* \} ) )/xg;

or, if you prefer the non-loop version:

my @matches = $str =~ /[^{}\s]+ | \{ (?: (?R) | [^{}]+ )+ \} /gx;
查看更多
SAY GOODBYE
6楼-- · 2020-02-03 08:04

To match nested brackets with just one pair at each level of nesting,
but any number of levels, e.g. {1{2{3}}}, you could use

/\{[^}]*[^{]*\}|\w+/g

To match when there may be multiple pairs at any level of nesting, e.g. {1{2}{2}{2}}, you could use

/(?>\{(?:[^{}]*|(?R))*\})|\w+/g

The (?R) is used to match the whole pattern recursively.

To match the text contained within a pair of brackets the engine must match (?:[^{}]*|(?R))*,
i.e. either [^{}]* or (?R), zero or more times *.

So in e.g. "{abc {def}}", after the opening "{" is matched, the [^{}]* will match the "abc " and the (?R) will match the "{def}", then the closing "}" will be matched.

The "{def}" is matched because (?R) is simply short for the whole pattern
(?>\{(?:[^{}]*|(?R))*\})|\w+, which as we have just seen will match a "{" followed by text matching [^{}]*, followed by "}".

Atomic grouping (?>...) is used to prevent the regex engine backtracking into bracketed text once it has been matched. This is important to ensure the regex will fail fast if it cannot find a match.

查看更多
该账号已被封号
7楼-- · 2020-02-03 08:05

The problem of matching balanced and nested delimiters is covered in perlfaq5 and I'll leave it to them to cover all the options including (?PARNO) and Regexp::Common.

But matching balanced items is tricky and prone to error, unless you really want to learn and maintain advanced regexes, leave it to a module. Fortunately there is Text::Balanced to handle this and so very much more. It is the Swiss Army Chainsaw of balanced text matching.

Unfortunately it does not handle escaping on bracketed delimiters.

use v5.10;
use strict;
use warnings;

use Text::Balanced qw(extract_multiple extract_bracketed);

my @strings = ("abc {{xyz} abc} {xyz}", "{abc} {{xyz}} abc");

for my $string (@strings) {
    say "Extracting from $string";

    # Extract all the fields, rather than one at a time.
    my @fields = extract_multiple(
        $string,
        [
            # Extract {...}
            sub { extract_bracketed($_[0], '{}') },
            # Also extract any other non whitespace
            qr/\S+/
        ],
        # Return all the fields
        undef,
        # Throw out anything which does not match
        1
    );

    say join "\n", @fields;
    print "\n";
}

You can think of extract_multiple like a more generic and powerful split.

查看更多
登录 后发表回答