i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:
- not self-closed single tags. for example:
<br>
or<hr>
... - mismatched casing pairs:
<td>
...</TD>
- attributes with no quotes marks:
<span class=hilite>...</SPAN>
- so on and so on... etc
suggest any suitable parser, please
thank you
If you're happy with Python, Beautiful Soup is just such a parser.
"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."
Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot
TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).