I need to match all of these opening tags:
<p>
<a href="foo">
But not these:
<br />
<hr class="foo" />
I came up with this and wanted to make sure I've got it right. I am only capturing the a-z
.
<([a-z]+) *[^/]*?>
I believe it says:
- Find a less-than, then
- Find (and capture) a-z one or more times, then
- Find zero or more spaces, then
- Find any character zero or more times, greedy, except
/
, then - Find a greater-than
Do I have that right? And more importantly, what do you think?
Try:
It is similar to yours, but the last
>
must not be after a slash, and also acceptsh1
.Output:
Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.
I'm sure you already know by now that you shouldn't use regex for this purpose.
Disclaimer: use a parser if you have the option. That said...
This is the regex I use (!) to match HTML tags:
It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like
<a name="badgenerator"">
, which show up on the web.I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:
or just combine if and if not.
To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.
Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...
It seems to me you're trying to match tags without a "/" at the end. Try this:
Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:
In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.
I have composed a haiku describing the nature of HTML.
I have also composed a haiku describing the nature of regex in Perl.
I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):
It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.
It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.
I bet I will get downvoted for this.
HTML Split
Against my expectation this got some upvotes, so I'll suggest some better regular expressions:
They are good for XML / XHTML.
With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.
The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.