Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.
For example:
DomRoot = parse("myhtml.html");
for (tags : DomRoot) {
}
Note: this is a HTML document not XHtml.
You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.
HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.
JTidy should let you do what you want.
Usage is fairly straight forward, but parsing is configurable. e.g.:
The JavaDoc is hosted here.
You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.
It is distributed under the Apache 2.0 license.
There are several open source tools to parse HTML from Java.
Check http://java-source.net/open-source/html-parsers
Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...