As the title says, here is an example input:
(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)
Of course, the matched strings will be processed by recursion.
I want the first recursion to match:
[
(outer
(center
(inner)
(inner)
center)
ouer),
(outer
(inner)
ouer),
(outer
ouer)]
And the after processes is needless to say...
Many regex implementations will not allow you to match an arbitrary amount of nesting. However, Perl, PHP and .NET support recursive patterns.
A demo in Perl:
#!/usr/bin/perl -w
my $text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';
while($text =~ /(\(([^()]|(?R))*\))/g) {
print("----------\n$1\n");
}
which will print:
----------
(outer
(center
(inner)
(inner)
center)
ouer)
----------
(outer
(inner)
ouer)
----------
(outer
ouer)
Or, the PHP equivalent:
$text = '(outer
(center
(inner)
(inner)
center)
ouer)
(outer
(inner)
ouer)
(outer
ouer)';
preg_match_all('/(\(([^()]|(?R))*\))/', $text, $matches);
print_r($matches);
which produces:
Array
(
[0] => Array
(
[0] => (outer
(center
(inner)
(inner)
center)
ouer)
[1] => (outer
(inner)
ouer)
[2] => (outer
ouer)
)
...
An explanation:
( # start group 1
\( # match a literal '('
( # group 2
[^()] # any char other than '(' and ')'
| # OR
(?R) # recursively match the entir pattern
)* # end group 2 and repeat zero or more times
\) # match a literal ')'
) # end group 1
EDIT
Note @Goozak's comment:
A better pattern might be \(((?>[^()]+)|(?R))*\)
(from PHP:Recursive patterns). For my data, Bart's pattern was crashing PHP when it encountered a (long string) without nesting. This pattern went through all my data without problem.
Don't use regex.
Instead, a simple recursive function will suffice:
def recursive_bracket_parser(s, i):
while i < len(s):
if s[i] == '(':
i = recursive_bracket_parser(s, i+1)
elif s[i] == ')':
return i+1
else:
# process whatever is at s[i]
i += 1
return i
I found this simple regex which extracts all nested balanced groups using recursion, although the resulting solution is not quite straightforward as you may expect:
Regex pattern: (1(?:\1??[^1]*?2))+
Sample input: 1ab1cd1ef2221ab1cd1ef222
For simplicity I put 1
for open and 2
for closed bracket. Alpha characters represent some inner data.
I'll rewrite input so that it can be easy to explain.
1 ab 1 cd 1ef2 2 2 1 ab 1 cd 1ef2 2 2
|_1|
|______2__|
|_____________3_____|
In first iteration regex will match the most inner subgroup 1ef2
of in first sibling group 1ab1cd1ef222
. If we remember it and it's position, and remove this group, there would remain 1ab1cd22
. If we continue with regex, it would return 1cd2
, and finally 1ab2
. Then, it will continue to parse second sibling group the same way.
As we see from this example, regex will properly extract substrings as they appear in the hierarchy defined by brackets. Position of particularly substring in hierarchy will be determined during second iteration, if it's position in string is between substring from second iteration, then it is a child node, else it's a sibling node.
From our example:
1ab1cd1ef222 1ab1cd1ef222
, iteration match 1ef2
, with index 6
,
1ab1cd22 1ab1cd1ef222
, iteration match 1cd2
, with index 3
, ending with 6
.
Because 3
< 6
<= 6
, first substring is child of the second substring.
1ab2 1ab1cd1ef222
, iteration match 1ab2
, with index 0
, ending with 3
.
Because 0
< 3
<= 3
, first substring is child of the second substring.
1ab1cd1ef222
, iteration match 1ef2
, with index 6
,
Because it's not 3
< 0
<= 6
, it's branch from another sibling, etc...
We must iterate and remove all siblings before we can move to the parent. Thus, we must remember all that siblings in the order they appear in iteration.
Delphi Pascal code based on posting above from nneonneo:
You need a form with a button on it, named btnRun. In the source code, replace "arnolduss" with your name in the DownLoads folder. Note the stack level in the output created by ParseList. Obviously brackets of the same type must open and close on the same stack level. You will now be able to extract the so-called groups per stack level.
unit Unit1;
interface
uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
Dialogs, StdCtrls;
Type TCharPos = record
Level: integer;
Pos: integer;
Char: string;
end;
type
TForm1 = class(TForm)
btnRun: TButton;
procedure btnRunClick(Sender: TObject);
private
{ Private declarations }
procedure FormulaFunctionParser(var CharPos: array of TCharPos;
const formula, LBracket, RBracket: string; var itr, iChar,
iLevel: integer; var ParseList: TStringList);
public
{ Public declarations }
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.btnRunClick(Sender: TObject);
var
formula: string;
CharPos: array of TCharPos;
itr, iChar, iLevel: integer;
ParseList: TStringList;
begin
screen.Cursor := crHourGlass;
itr := 0;
iChar := 1;
iLevel := 0;
ParseList := TStringList.Create;
try
formula := Trim('add(mul(a,add(b,c)),d) + e - sub(f,g)');
ParseList.Add(formula);
ParseList.Add('1234567890123456789012345678901234567890');
SetLength(CharPos, Length(formula));
FormulaFunctionParser(CharPos[0], formula, '(', ')', itr, iChar, iLevel, ParseList);
finally
ParseList.SaveToFile('C:\Users\arnolduss\Downloads\ParseList.txt');
FreeAndNil(ParseList);
screen.Cursor := crDefault;
end;
end;
//Based on code from nneonneo in: https://stackoverflow.com/questions/14952113/how-can-i-match-nested-brackets-using-regex
procedure TForm1.FormulaFunctionParser(var CharPos: array of TCharPos;
const formula, LBracket, RBracket: string; var itr, iChar,
iLevel: integer; var ParseList: TStringList);
procedure UpdateCharPos;
begin
CharPos[itr-1].Level := iLevel;
CharPos[itr-1].Pos := itr;
CharPos[itr-1].Char := formula[iChar];
ParseList.Add(IntToStr(iLevel) + ' ' + intToStr(iChar) + ' = ' + formula[iChar]);
end;
begin
while iChar <= length(formula) do
begin
inc(itr);
if formula[iChar] = LBracket then
begin
inc(iLevel);
UpdateCharPos;
inc(iChar);
FormulaFunctionParser(CharPos, formula, LBracket, RBracket, itr, iChar, iLevel, ParseList);
end
else
if formula[iChar] = RBracket then
begin
UpdateCharPos;
dec(iLevel);
inc(iChar);
end
else
begin
UpdateCharPos;
inc(iChar);
end;
end;
end;
end.