While it's absolutely true that regexp are not the right tool to fully parse HTML documents, I am seeing a lot of people blindly disregarding any question about regexp if they as much as see a single HTML tag in the proposed text.
Since we see a lot of examples of regexp not being the right tool, I ask your opinion on this: what are the cases where a simple pattern match is a better solution than using a full parsing engine?
When you know what you're doing!
; )
I just found out an example of regexp beating html parser. I needed to extract some information from a long page (8231 lines, 400kb) and I first tried using simple_html_dom. Since I got stuck due to the problem reported in this question, I went for the alternative approach and I realized that I actually only needed informations contained in the first 416 lines of that file (~4% of the total) and loading the whole DOM into memory looked like a huge waste of resources.
Now I still don't know why simplehtmldom is failing on that, so I can't really compare the performance of the two solutions, but the regexp version only loads as many lines as needed (up to the end of the
<ul>
I'm interested in and no more) and is very quick.Obviously, in the most simple cases like
you might get along with a regex. But even then, a perfectly valid HTML tag could come in so many different varieties:
that the regex to catch them reliably gets HUGE. A DOM based parser will parse it, give you a proper error message if it fails, and provide stable results.
Jeff Atwood discusses it extensively in his blog posts entitled Programming Is Hard Let's Go Shopping and Parsing HTML The Cthulhu Way.
Find more details in the posts mentioned above.
If you can guarantee that the pattern you need to match is within a single HTML tag, then maybe you could create a regular expression to match it.
In other words, not when you need an expression to find matching tag/endtags and not when the content you need to match might contain nested tags, comments, CDATA sections, etc.
You can use regexp when either you parse HTML you have control over or you are writing a parser for one specific HTML page. You should not use regexp when trying to build universal parser.