Why use dom to parse webpages instead of regex?

2019-05-11 21:15发布

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

3条回答
We Are One
2楼-- · 2019-05-11 21:58

To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML might be not formed properly, then DOM parser can fail.

Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

查看更多
倾城 Initia
3楼-- · 2019-05-11 22:14

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

查看更多
4楼-- · 2019-05-11 22:19

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.

The simple answer is:

A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.

If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.

In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

查看更多
登录 后发表回答