RegEx match open tags except XHTML self-contained

2020-01-22 07:32发布

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think?

标签: html regex xhtml
30条回答
老娘就宠你
2楼-- · 2020-01-22 08:07

Try:

<([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

查看更多
走好不送
3楼-- · 2020-01-22 08:07
<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');

$html = '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>';

$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
    $nodeName = strtolower($el->nodeName);
    if ( !in_array( $nodeName, $selfClosing ) ) {
        var_dump( $nodeName );
    }
}

Output:

string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

查看更多
Animai°情兽
4楼-- · 2020-01-22 08:08

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:

<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

查看更多
你好瞎i
5楼-- · 2020-01-22 08:09

It seems to me you're trying to match tags without a "/" at the end. Try this:

<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>
查看更多
疯言疯语
6楼-- · 2020-01-22 08:12

Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

HTML has
complexity exceeding
regular language.

I have also composed a haiku describing the nature of regex in Perl.

The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>
查看更多
淡お忘
7楼-- · 2020-01-22 08:12

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

I bet I will get downvoted for this.

HTML Split


Against my expectation this got some upvotes, so I'll suggest some better regular expressions:

/(<.*?>|[^<]+)\s*/g    # get tags and text
/(\w+)="(.*?)"/g       # get attibutes

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.


The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

查看更多
登录 后发表回答