I'm trying to load a piece of (possibly) malformed HTML into an XMLDocument object, but it fails with XMLExceptions... since there are extra opening/closing tags, and malformed XML tags such as <img >
instead of <img />
How do I get the XML to parse with all the errors in the data? Is there any XML validator that I can apply before parsing, to correct these errors? Or would handling the exception parse whatever can be parsed?
The HTML Agility Pack will parse html, rather than xhtml, and is quite forgiving. The object model will be familiar if you've used XmlDocument
.
You might want to check out the answer to this question.
Basically somewhere between a .NET port of beautifulsoup and the HTML agility pack there is a way.
It's unlikely that you will be able to build an XmlDocument that has this level of malformed structure. XmlDocument (to my knowledge) requires that xml content adhere to proper nesting and closure syntax.
However, you suspect that you could parse this with an XmlReader instead. It may still throw exceptions if certain egregious errors are encountered, but according to the MSDN docs, it can at least disclose the location of the errors.
If you're just dealing with HTML, there is the HTML Agility Pack, which may serve your purposes.
Depending ont he specific needs, you might be able to use HTML Tidy to cleanup the document, then import it using the XMLDocument object.
What you are trying to do is very difficult. HTML cannot be parsed using an XML parser since XML is strict and HTML is not. If that HTML were compliant XHTML (HTML as XML), then an XML parser would parse the HTML without issue.
You might want to see if there are any HTML to XHTML converters out there, if you really want to use an XML parser for HTML.
In other words, I have yet to meet an XML parser that handles malformed XML... they are not designed to accept loose markup like HTML (for good reason, too :) )
You can't load malformed XML into a XmlDocument
.
Check out the Html Agility Pack on CodePlex