I am currently trying to write a regular expression in PHP that allows me to match a specific pattern containing itself indefinetely nested. I know that per default regular expressions are not capable of doing that, but PHP's Recursive Patterns (http://php.net/manual/de/regexp.reference.recursive.php) should make it possible.
I have nested structures like this:
<a=5>
<a=3>
Foo
<b>Bar</b>
</a>
Baz
</a>
Now I want to match the content of the outmost tag. In order to correctly match up the first opening tag with the last closing tag, I need PHP's recursion item (?R)
.
I tried a pattern like this:
/<a=5>((?R)|[^<]|<\/?[^a]|<\/?a[a-zA-Z0-9-])*<\/a>/s
Which basically means <a=5>
, followed by as many as possible of the following, followed by </a>
:
- another tag (recursively)
- any not-opening-tag character
- any opening tag, followed by an optional slash, not followed by an "a"
- the before WITH an a, but not finished (followed by at least 1 more character)
The last 2 cases could be just one case [tag not namend "a"], but I heard this should be avoided in regular expressions, because it needs lookarounds and would have bad performance.
However, I see no mistake in my RegEx, but it does not match the given string. I want the following match:
<a=3>
Foo
<b>Bar</b>
</a>
Baz
Here's a link to play around with the RegEx: https://www.regex101.com/r/lO1wA6/1
You can use this regex to match what you want (the regex placed in a string literal for sake of convenience):
'~<a=5>(<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>|[^<>]++)*</a>~'
Here is a break down of the regex above:
<a=5>
(
<([a-zA-Z0-9]+)[^>]*>
(?1)*
</\2>
|
[^<>]++
)*
</a>
The first part <([a-zA-Z0-9]+)[^>]*>(?1)*</\2>
matches pair of matching tags and all its content. It assumes that the name of the tag consists of the characters [a-zA-Z0-9]
. The name of the tag is captured ([a-zA-Z0-9]+)
and backreference when matching the closing tag </\2>
.
The second part [^<>]++
matches whatever else outside the tags. Note that there is no handling of quoted string, so depending on your input it may not work.
Then back to the routine call which recursively calls the first capturing group. You would notice that a tag can contain 0 or more instances of other tags or non-tag contents. Due to the way the regex is written, this property is also shared by the outer most <a=5>...</a>
pair.
Demo on regex101
try this:
PHP
$re = "/(<[^\\/>]+(\\/?)>)*([^<]+)(<\\/\\w+>)*/m";
$str = "<a=5>\n <a=3>\n Foo\n <b/>Bar</b>\n </a>\n Baz\n</a>";
preg_match_all($re, $str, $matches);
var_dump($matches);
// here
$matches[1]; //for open tag array
$matches[2]; //for single tag mark array by ( />)
$matches[3]; //for inner data array
$matches[4]; //for close tag array
output
array (size=5)
0 =>
array (size=5)
0 => string '<a=5>
' (length=7)
1 => string '<a=3>
Foo
' (length=12)
2 => string '<b/>Bar</b>' (length=11)
3 => string '
</a>' (length=6)
4 => string '
Baz
</a>' (length=10)
1 =>
array (size=5)
0 => string '<a=5>' (length=5)
1 => string '<a=3>' (length=5)
2 => string '<b/>' (length=4)
3 => string '' (length=0)
4 => string '' (length=0)
2 =>
array (size=5)
0 => string '' (length=0)
1 => string '' (length=0)
2 => string '/' (length=1)
3 => string '' (length=0)
4 => string '' (length=0)
3 =>
array (size=5)
0 => string '
' (length=2)
1 => string '
Foo
' (length=7)
2 => string 'Bar' (length=3)
3 => string '
' (length=2)
4 => string '
Baz
' (length=6)
4 =>
array (size=5)
0 => string '' (length=0)
1 => string '' (length=0)
2 => string '</b>' (length=4)
3 => string '</a>' (length=4)
4 => string '</a>' (length=4)
Live Demo
OR
$re = "/(<[^\\/>]+\\/?>)*([^<]+)(<\\/\\w+>)*/m";
$str = "<a=5>fff\n <a=3>\n Foo\n <b/>Bar</b>\n </a>\n Baz\n</a>";
preg_match_all($re, $str, $matches);
//var_dump($matches);
$md="";
$c=count($matches[1]);
foreach($matches[1] as $k=>$v){
if($k!=0){
$md.=$v.$matches[2][$k].$matches[3][$k];
}
else if ($c!=$k+1){
$md.=$matches[2][$k].$matches[3][$k];
}
}
var_dump($md);
Live
Output
string 'fff
<a=3>
Foo
<b/>Bar</b>
</a>
Baz
</a>' (length=44)