When is it wise to use regular expressions with HT

While it's absolutely true that regexp are not the right tool to fully parse HTML documents, I am seeing a lot of people blindly disregarding any question about regexp if they as much as see a single HTML tag in the proposed text.

Since we see a lot of examples of regexp not being the right tool, I ask your opinion on this: what are the cases where a simple pattern match is a better solution than using a full parsing engine?

标签： html regex parsing

10条回答

叼着烟拽天下

2楼-- · 2019-01-19 00:47

When you know what you're doing!

; )

0人赞添加讨论(0) 举报

女痞

3楼-- · 2019-01-19 00:50

I just found out an example of regexp beating html parser. I needed to extract some information from a long page (8231 lines, 400kb) and I first tried using simple_html_dom. Since I got stuck due to the problem reported in this question, I went for the alternative approach and I realized that I actually only needed informations contained in the first 416 lines of that file (~4% of the total) and loading the whole DOM into memory looked like a huge waste of resources.

Now I still don't know why simplehtmldom is failing on that, so I can't really compare the performance of the two solutions, but the regexp version only loads as many lines as needed (up to the end of the <ul> I'm interested in and no more) and is very quick.

0人赞添加讨论(0) 举报

傲

4楼-- · 2019-01-19 00:58

Obviously, in the most simple cases like

<a>Test</a>

you might get along with a regex. But even then, a perfectly valid HTML tag could come in so many different varieties:

< A > Test</a>                // match
< a href="test">   Test</a>   // match
< A TEST="test"/>             // no match
< a href="test<">Test</A>     // invalid input - catch that with a regex!

that the regex to catch them reliably gets HUGE. A DOM based parser will parse it, give you a proper error message if it fails, and provide stable results.

0人赞添加讨论(0) 举报

Anthone

5楼-- · 2019-01-19 01:02

Jeff Atwood discusses it extensively in his blog posts entitled Programming Is Hard Let's Go Shopping and Parsing HTML The Cthulhu Way.

"So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand."

Find more details in the posts mentioned above.

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2019-01-19 01:03

If you can guarantee that the pattern you need to match is within a single HTML tag, then maybe you could create a regular expression to match it.

In other words, not when you need an expression to find matching tag/endtags and not when the content you need to match might contain nested tags, comments, CDATA sections, etc.

0人赞添加讨论(0) 举报

你好瞎i

7楼-- · 2019-01-19 01:07

You can use regexp when either you parse HTML you have control over or you are writing a parser for one specific HTML page. You should not use regexp when trying to build universal parser.

0人赞添加讨论(0) 举报

1 2 下一页

When is it wise to use regular expressions with HT

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间