How to parse not strict HTML documents indulgently

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:

not self-closed single tags. for example: <br> or <hr>...
mismatched casing pairs: <td>...</TD>
attributes with no quotes marks: <span class=hilite>...</SPAN>
so on and so on... etc

suggest any suitable parser, please
thank you

标签： html parsing

3条回答

放荡不羁爱自由

2楼-- · 2019-06-02 22:45

If you're happy with Python, Beautiful Soup is just such a parser.

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-06-02 22:45

Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot

0人赞添加讨论(0) 举报

孤傲高冷的网名

4楼-- · 2019-06-02 22:48

TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).

...TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

0人赞添加讨论(0) 举报

How to parse not strict HTML documents indulgently

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间