One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:
People want to treat a file as a sequence of lines, but this is valid:
<tag
attr="5"
/>
People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:
<img src="imgtag.gif" alt="<img>" />
People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):
<span id="outer"><span id="inner">foo</span></span>
People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):
<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>
Comments may contain poorly formatted or incomplete tags:
<a href="foo">foo</a>
<!-- FIXME:
<a href="
-->
<a href="bar">bar</a>
What other gotchas are you aware of?
I think the problems boil down to:
The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.
If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.
If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.
It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.
I disagree. If you will use recursive in regex, you can easily find open and close tags.
Here I showed example of regex to avoid parsing errors of examples in first message.
One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.
It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.
People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.
I believe this classic has the information you are looking for. You can find the point in one of the comments there:
Some more info from Wikipedia: Chomsky Hierarchy