How can I match nested brackets using regex?

2020-01-22 14:08发布

问题:

As the title says, here is an example input:

 (outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)

Of course, the matched strings will be processed by recursion.

I want the first recursion to match:

 [
 (outer
   (center
     (inner)
     (inner)
   center)
 ouer),
 (outer
   (inner)
 ouer),
 (outer
 ouer)]

And the after processes is needless to say...

回答1:

Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.

A demo in Perl:

#!/usr/bin/perl -w

my $text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

while($text =~ /(\(([^()]|(?R))*\))/g) {
  print("----------\n$1\n");
}

which will print:

----------
(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
----------
(outer
   (inner)
 ouer)
----------
(outer
 ouer)

Or, the PHP equivalent:

$text = '(outer
   (center
     (inner)
     (inner)
   center)
 ouer)
 (outer
   (inner)
 ouer)
 (outer
 ouer)';

preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);

print_r($matches);

which produces:

Array
(
    [0] => Array
        (
            [0] => (outer
   (center
     (inner)
     (inner)
   center)
 ouer)
            [1] => (outer
   (inner)
 ouer)
            [2] => (outer
 ouer)
        )

...

An explanation:

(          # start group 1
  \(       #   match a literal '('
  (        #   group 2
    [^()]  #     any char other than '(' and ')'
    |      #     OR
    (?R)   #     recursively match the entir pattern
  )*       #   end group 2 and repeat zero or more times
  \)       #   match a literal ')'
)          # end group 1

EDIT

Note @Goozak's comment:

A better pattern might be \(((?>[^()]+)|(?R))*\) (from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.



回答2:

Don't use regex.

Instead, a simple recursive function will suffice:

def recursive_bracket_parser(s, i):
    while i < len(s):
        if s[i] == '(':
            i = recursive_bracket_parser(s, i+1)
        elif s[i] == ')':
            return i+1
        else:
            # process whatever is at s[i]
            i += 1
    return i


回答3:

I found this simple regex which extracts all nested balanced groups using recursion, although the resulting solution is not quite straightforward as you may expect:

Regex pattern: (1(?:\1??[^1]*?2))+

Sample input: 1ab1cd1ef2221ab1cd1ef222

For simplicity I put 1 for open and 2 for closed bracket. Alpha characters represent some inner data. I'll rewrite input so that it can be easy to explain.

1  ab  1 cd 1ef2 2  2     1  ab  1 cd 1ef2 2  2

            |_1|
       |______2__|
|_____________3_____|

In first iteration regex will match the most inner subgroup 1ef2 of in first sibling group 1ab1cd1ef222. If we remember it and it's position, and remove this group, there would remain 1ab1cd22. If we continue with regex, it would return 1cd2, and finally 1ab2. Then, it will continue to parse second sibling group the same way.

As we see from this example, regex will properly extract substrings as they appear in the hierarchy defined by brackets. Position of particularly substring in hierarchy will be determined during second iteration, if it's position in string is between substring from second iteration, then it is a child node, else it's a sibling node.

From our example:

  1. 1ab1cd1ef222 1ab1cd1ef222, iteration match 1ef2, with index 6,

  2. 1ab1cd22 1ab1cd1ef222, iteration match 1cd2, with index 3, ending with 6. Because 3 < 6 <= 6, first substring is child of the second substring.

  3. 1ab2 1ab1cd1ef222, iteration match 1ab2, with index 0, ending with 3. Because 0 < 3 <= 3, first substring is child of the second substring.

  4. 1ab1cd1ef222, iteration match 1ef2, with index 6, Because it's not 3 < 0 <= 6, it's branch from another sibling, etc...

We must iterate and remove all siblings before we can move to the parent. Thus, we must remember all that siblings in the order they appear in iteration.



回答4:

Delphi Pascal code based on posting above from nneonneo:

You need a form with a button on it, named btnRun. In the source code, replace "arnolduss" with your name in the DownLoads folder. Note the stack level in the output created by ParseList. Obviously brackets of the same type must open and close on the same stack level. You will now be able to extract the so-called groups per stack level.

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
  Dialogs, StdCtrls;

Type TCharPos = record
  Level: integer;
  Pos: integer;
  Char: string;
end;

type
  TForm1 = class(TForm)
    btnRun: TButton;
    procedure btnRunClick(Sender: TObject);
  private
    { Private declarations }
    procedure FormulaFunctionParser(var CharPos: array of TCharPos;
       const formula, LBracket, RBracket: string; var itr, iChar,
       iLevel: integer; var ParseList: TStringList);
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.btnRunClick(Sender: TObject);
var
  formula: string;
  CharPos: array of TCharPos;
  itr, iChar, iLevel: integer;
  ParseList: TStringList;
begin
  screen.Cursor := crHourGlass;
  itr := 0;
  iChar := 1;
  iLevel := 0;
  ParseList := TStringList.Create;
  try
    formula := Trim('add(mul(a,add(b,c)),d) + e - sub(f,g)');
    ParseList.Add(formula);
    ParseList.Add('1234567890123456789012345678901234567890');

    SetLength(CharPos, Length(formula));
    FormulaFunctionParser(CharPos[0], formula, '(', ')', itr, iChar, iLevel, ParseList);
  finally
    ParseList.SaveToFile('C:\Users\arnolduss\Downloads\ParseList.txt');
    FreeAndNil(ParseList);
    screen.Cursor := crDefault;
  end;
end;

//Based on code from nneonneo in: https://stackoverflow.com/questions/14952113/how-can-i-match-nested-brackets-using-regex
procedure TForm1.FormulaFunctionParser(var CharPos: array of TCharPos;
       const formula, LBracket, RBracket: string; var itr, iChar,
       iLevel: integer; var ParseList: TStringList);
  procedure UpdateCharPos;
  begin
    CharPos[itr-1].Level := iLevel;
    CharPos[itr-1].Pos := itr;
    CharPos[itr-1].Char := formula[iChar];

    ParseList.Add(IntToStr(iLevel) + '  ' + intToStr(iChar) + ' = ' + formula[iChar]);
  end;
begin
  while iChar <= length(formula) do
  begin
    inc(itr);
    if formula[iChar] = LBracket then
    begin
      inc(iLevel);
      UpdateCharPos;
      inc(iChar);
      FormulaFunctionParser(CharPos, formula, LBracket, RBracket, itr, iChar, iLevel, ParseList);
    end
    else
    if formula[iChar] = RBracket then
    begin
      UpdateCharPos;
      dec(iLevel);
      inc(iChar);
    end
    else
    begin
      UpdateCharPos;
      inc(iChar);
    end;
  end;
end;

end.


标签: regex nested