可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.

For example:

DomRoot = parse("myhtml.html");

for (tags : DomRoot) {
}

Note: this is a HTML document not XHtml.

回答1:

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

回答2:

JTidy should let you do what you want.

Usage is fairly straight forward, but parsing is configurable. e.g.:

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

The JavaDoc is hosted here.

回答3:

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

It is distributed under the Apache 2.0 license.