Can you provide some examples of why it is hard to

2018-12-30 23:01发布

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag
attr="5"
/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span> 

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>
<!-- FIXME:
    <a href="
-->
<a href="bar">bar</a>

What other gotchas are you aware of?

标签: html xml regex
12条回答
回忆,回不去的记忆
2楼-- · 2018-12-30 23:28

I think the problems boil down to:

  1. The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.

  2. If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.

  3. If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.

  4. It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.

查看更多
时光乱了年华
3楼-- · 2018-12-30 23:30

Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

I disagree. If you will use recursive in regex, you can easily find open and close tags.

Here I showed example of regex to avoid parsing errors of examples in first message.

查看更多
ら面具成の殇う
4楼-- · 2018-12-30 23:32

One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.

查看更多
还给你的自由
5楼-- · 2018-12-30 23:32

It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

查看更多
还给你的自由
6楼-- · 2018-12-30 23:35

People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.

查看更多
旧时光的记忆
7楼-- · 2018-12-30 23:39

I believe this classic has the information you are looking for. You can find the point in one of the comments there:

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Some more info from Wikipedia: Chomsky Hierarchy

查看更多
登录 后发表回答