Why use dom to parse webpages instead of regex?

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

标签： php regex parsing dom search

3条回答

We Are One

2楼-- · 2019-05-11 21:58

To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML might be not formed properly, then DOM parser can fail.

Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2019-05-11 22:14

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

0人赞添加讨论(0) 举报

傲

4楼-- · 2019-05-11 22:19

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.

The simple answer is:

A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.

If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.

In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

0人赞添加讨论(0) 举报

Why use dom to parse webpages instead of regex?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间