When is it wise to use regular expressions with HT

2019-01-19 00:34发布

While it's absolutely true that regexp are not the right tool to fully parse HTML documents, I am seeing a lot of people blindly disregarding any question about regexp if they as much as see a single HTML tag in the proposed text.

Since we see a lot of examples of regexp not being the right tool, I ask your opinion on this: what are the cases where a simple pattern match is a better solution than using a full parsing engine?

10条回答
疯言疯语
2楼-- · 2019-01-19 01:08

I think the best answer here is: regular expressions are the right tool except for when they aren't.

I think if you can cleanly and effectively solve your problem using regex, then go for it. But i've seen far too many regex hacks because the programmer / web designer is just plain lazy.

Regex is powerful and one of the best tools a programmer can learn, but you also need to learn when to use it and when to use something different.

查看更多
姐就是有狂的资本
3楼-- · 2019-01-19 01:10

If the set of HTML you're looking to parse with a regexp is known to conform to some sort of pattern. e.g. if you know there's no commented-out HTML, or complex scenarios etc.

e.g. I often preach that you shouldn't use regexps for HTML, but if I have a set of HTML that I'm familiar with, is straightforward and that I can check easily post-manipulation, then I have no qualms about using a regexp for that.

查看更多
疯言疯语
4楼-- · 2019-01-19 01:11

One thing worth keeping in mind is that there are two main sources of objection to processing HTML with regular expressions. One source has to do with the probability of junk HTML that is unpredictably malformed. This is itself a legitimate reason to be skeptical when approaching HTML processing with regex, and tosses out a lot of use cases from the start. The problem is that this source is often used to "throw out the baby with the bathwater", and is also often conflated with the second main source of objection (and usually both left unsaid) even though they're completely unrelated.

The other main source of objection has to do with HTML language complexity exceeding some idealized, theoretical conception of "regular expression" that is too general to apply to many use cases—but is usually applied across the board. The objection goes something like this:

  1. Truism: Regular expressions process regular grammars.
  2. Truism: HTML is not a regular grammar.
  3. HTML cannot be processed with regular expressions.

I think a lot of people really just take these truisms at face value without considering what's meant by them. Bill Karwin, in another answer here, mentioned some cases where HTML is not a regular grammar, but this argument falls apart when the context is a "regex" engine that has non-regular features (like back references, or even recursion). These features solve many of the "not a regular grammar" objections, but may still fail on malformed documents.

This distinction is rarely drawn and it's rarely pointed out that most modern "regular" expression libraries have capabilities far beyond regular language processing. I think these are important things to consider whenever evaluating "regular" expressions for the appropriate tool to process some HTML.

查看更多
Rolldiameter
5楼-- · 2019-01-19 01:14

If the information that you are using has a regular grammar, then regexs are great. HTML doesn't have a regular grammar, so things are more complex.

Regexs are suitable if you absolutely 100% know what sort of thing you are looking for - replacing:

<tag>Info</tag>

with

<tag>Dave</tag>

In a document that you have complete control of would make sense, but real life HTML isn't like this.

查看更多
登录 后发表回答