I am making a preg_replace on html page. My pattern is aimed to add surrounding tag to some words in html. However, sometimes my regular expression modifies html tags. For example, when I try to replace this text:
<a href="example.com" alt="yasar home page">yasar</a>
So that yasar
reads <span class="selected-word">yasar</span>
, my regular expression also replaces yasar in alt attribute of anchor tag. Current preg_replace()
I am using looks like this:
preg_replace("/(asf|gfd|oyws)/", '<span class=something>${1}</span>',$target);
How can I make a regular expression, so that it doesn't match anything inside a html tag?
You can use an assertion for that, as you just have to ensure that the searched words occur somewhen after an >
, or before any <
. The latter test is easier to accomplish as lookahead assertions can be variable length:
/(asf|foo|barr)(?=[^>]*(<|$))/
See also http://www.regular-expressions.info/lookaround.html for a nice explanation of that assertion syntax.
Yasar, resurrecting this question because it had another solution that wasn't mentioned.
Instead of just checking that the next tag character is an opening tag, this solution skips all <full tags>
.
With all the disclaimers about using regex to parse html, here is the regex:
<[^>]*>(*SKIP)(*F)|word1|word2|word3
Here is a demo. In code, it looks like this:
$target = "word1 <a skip this word2 >word2 again</a> word3";
$regex = "~<[^>]*>(*SKIP)(*F)|word1|word2|word3~";
$repl= '<span class="">\0</span>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);
Here is an online demo of this code.
Reference
- How to match pattern except in situations s1, s2, s3
- How to match a pattern unless...
This might be the kind of thing that you're after: http://snipplr.com/view/3618/
In general, I'd advise against such. A better alternative is to strip out all HTML tags and instead rely on BBcode, such as:
[b]bold text[b] [i]italic text[i]
However I appreciate that this might not work well with what you're trying to do.
Another option may be HTML Purifier, see: http://htmlpurifier.org/
From top of my mind, this should be working:
echo preg_replace("/<(.*)>(.*)<\/(.*)>/i","<$1><span class=\"some-class\">$2</span></$3>",$target);
But, I don't know how safe this would be. I am just presenting a possibility :)