Remove unnecessary close tags using regex

2020-04-30 18:01发布

问题:

I'm looking for a regex, which removes close tags, and everything, until it finds an open tag. For example:

</xy>..</zz>..<a>... -> <a>...

</b>..</cc>..<a href="#">...</a> -> <a href="#">...</a>

I tried this, but doesn't work for some reason:

$html = preg_replace("/^.*<.*>/","<.*>",$html);

回答1:

Below regex would capture and stores all the text before an opening tag into a group(group1) and also it would capture and stores the remaining strings into another group. So the second group contains the text from the opening tag.

(.*)(<\w.*)

DEMO

Your php code would be,

<?php
$re = '~(.*)(<\w.*)~'; 
$str= '</b>..</cc>..<a href="#">...</a> -> <a href="#">...</a>';
$replacement = "$2";
echo preg_replace($re, $replacement, $str);
?> //=>  <a href="#">...</a>

OR

<?php
$re = '~(?:.*)(<\w.*)~'; 
$str= '</p>\n<p>Â </p>';
$replacement = "$1";
echo preg_replace($re, $replacement, $str);
?>

Explanation:

  • (.*)(<\w.*) capture from the begining of the string and stops capturing when it finds a < folllowed by an \w word character. Strings before <\w are stored inside group 1 and the strings after <\w are stored inside group2(Including <\w).


回答2:

If I understand correctly your responses to Avinash Raj's answer you need something which matches any number of lines of input upto the first open tag, but that only matches once so all subsequent content is maintained.

.*(\n.*?)*?(<\w.*(\n.*)*)

The first part

.*(\n.*?)*?

Matches any number of lines but not greedily (hence the ?s), so it will stop at the first line which contains an open tag:

<\w

This is then followed once again by any number of lines of anything:

.*(\n.*)*

So to extract what you want you would replace

.*(\n.*?)*?(<\w.*(\n.*)*)

With

\2

Which is everything from and including the first open tag.