How to parse not strict HTML documents indulgently

2019-06-02 22:32发布

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:

  • not self-closed single tags. for example: <br> or <hr>...
  • mismatched casing pairs: <td>...</TD>
  • attributes with no quotes marks: <span class=hilite>...</SPAN>
  • so on and so on... etc

suggest any suitable parser, please
thank you

标签: html parsing
3条回答
放荡不羁爱自由
2楼-- · 2019-06-02 22:45

If you're happy with Python, Beautiful Soup is just such a parser.

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

查看更多
趁早两清
3楼-- · 2019-06-02 22:45

Hpricot is particularly good at parsing broken markup if you're not afraid of a bit of Ruby. http://github.com/whymirror/hpricot

查看更多
孤傲高冷的网名
4楼-- · 2019-06-02 22:48

TagSoup is available for various languages, including Java, C++ (Taggle) and XSLT (TSaxon).

...TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

查看更多
登录 后发表回答