How do I ignore html tags in this preg_replace. I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the
<span>
and you're done.Edit: Finally some code ;)
First it makes use of
xpath
to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:$search
contains the text to search for, not containing any"
(quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a
TextRange
class that represents a list ofDOMText
nodes. It is useful to do string-operations on a list of textnodes as if they were one string.This is the base skeleton of the routine:
For my example XML:
It produces the following result:
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the
TextRange
class that I have taken out of the answers example).It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (
strpos
) and the related offsets for splitting textnodes with theDOMText::splitText
function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to usemb_strpos
to obtain theUTF-8
based value.The example works anyway because it's only making use of
US-ASCII
which has the same offsets asUTF-8
for the example-data.For a real life situation, the
$search
string should be UTF-8 encoded andmb_strpos
should be used instead ofstrpos
: