I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:
<a href="example.com" alt="yasar home page">yasar</a>
So that yasar
reads <span class="selected-word">yasar</span>
, my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace()
I am using looks like this:
preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);
How can I make a regular expression, so that it doesn't match anything inside a html tag?
This might be the kind of thing that you're after: http://snipplr.com/view/3618/ In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:
However I appreciate that this might not work well with what you're trying to do.
Another option may be HTML Purifier, see: http://htmlpurifier.org/
Yasar, resurrecting this question because it had another solution that wasn't mentioned.
Instead of just checking that the next tag character is an opening tag, this solution skips all
<full tags>
.With all the disclaimers about using regex to parse html, here is the regex:
Here is a demo. In code, it looks like this:
Here is an online demo of this code.
Reference
From top of my mind, this should be working:
But, I don't know how safe this would be. I am just presenting a possibility :)
You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an
>
, or before any<
. The latter test is easier to accomplish as lookahead assertions can be variable length:See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.