How to write a regular expression for html parsing

2020-02-13 03:16发布

问题:

I'm trying to write a regular expression for my html parser.

I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.

I'm using boost regex libraries.

回答1:

You may also find these questions helpful:

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Can you provide an example of parsing HTML with your favorite parser?



回答2:

You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.



回答3:

As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.



回答4:

Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.

I'm using these regexps:

/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:

/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links

(BTW can it be done better? - I suck at regex ;))

What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.



回答5:

Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.

This takes two regexps and a state variable.

SGML tags valid characters are [A-Za-z_:]

So: /<[A-Za-z_:]+>/ matches a tag.