I was wondering which method mentioned in the title is more efficient to replace content in a html page.
I have this custom tag in my page: <includes module='footer'/>
which will be replaced with some content.
Now there are some downsides with using DOMDocument->getElementsByTagName('includes')->item(0)->parentNode->replaceChild
for instance when i forgot to add the slash in the tag, like so <includes module='footer'>
the whole site crashes.
Regex allows exceptions like these, as long it matches the rule. It would even allow me to replace any string, like {includes:footer}
.
Now back to my actual question. Are there any downsides using regex for this purpose, like performance issues...?
More here: Append child/element in head using XML Manipulation
cheers
So i did some naive performance testing using microtime(true). And it turns out using preg_replace is the faster option. While DOM replaceChild needed between 2.0 and 3.5 ms, preg_replace needed between 0.5 and 1.2 ms! But i guess thats only in my case.
This is how my html looks like:
this is the regex is used:
/{([ ]*)includes:([ ]*)$key([^}]*)}/i
As i said, i'm not fully proficient in using regex, but this did the job. Guess if you optimize it, it would run even faster.
For the replaceChild method i used a custom tag like this:
<includes module='body'/>
Again, this is testet on my local server, therefore i still need to make some tests of how it will behave on my online server...
I wouldn't be too worried about performance here, I would consider them "comparable". Benchmarks would need to be ran to truly determine this, as it would depend on the size of the document and how the regular expression is written.
Instead, I would be concerned about accuracy. In general
DOMDocument
will be much better at parsing XML since it was built to read and understand the language. However, it does fail on<includes module='footer'>
because it is an un-closed tag (expecting:</includes>
).Most common HTML/XML formatting issues can be fixed with PHP's
Tidy
class. I would check this out, since you should receive much more "expected results" compared to if you used regex to parse. If you used a regular expression, there could technically be attributes before/after themodule
, elements within theincludes
element, unexpected characters like<includes module='foo>bar'>
, etc.In the end, if your XML is in a "controlled" environment (i.e. you know what can and can't happen, you know what possible characters
module
will contain, you know that it will always be a self closing element containing now children, etc.) than by all means use a regular expression. Just know it is looking for a very specific set of rules. However, if you expect for this to work with "anything you throw at it"..please use a DOM parser (afterTidy
'ing to avoid the exceptions), regardless of performance (although I bet it will be very comparable in many instances).Also, final note, if you plan to find/replace/manipulate many nodes in a document, you will see a large performance increase by going with a DOM parser. A DOM parser will take a document and parse it, once. Then you just traverse the data it already has loaded into its class. This is compared to using regular expressions, where each individual one will be ran across the whole document looking for a set of matches.
If you want me to get more specific in any area (i.e. give a
Tidy
example, or work on a benchmark), let me know.