可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to load a piece of (possibly) malformed HTML into an XMLDocument object, but it fails with XMLExceptions... since there are extra opening/closing tags, and malformed XML tags such as <img > instead of <img />

How do I get the XML to parse with all the errors in the data? Is there any XML validator that I can apply before parsing, to correct these errors? Or would handling the exception parse whatever can be parsed?

回答1:

The HTML Agility Pack will parse html, rather than xhtml, and is quite forgiving. The object model will be familiar if you've used XmlDocument.

回答2:

You might want to check out the answer to this question.

Basically somewhere between a .NET port of beautifulsoup and the HTML agility pack there is a way.

回答3:

It's unlikely that you will be able to build an XmlDocument that has this level of malformed structure. XmlDocument (to my knowledge) requires that xml content adhere to proper nesting and closure syntax.

However, you suspect that you could parse this with an XmlReader instead. It may still throw exceptions if certain egregious errors are encountered, but according to the MSDN docs, it can at least disclose the location of the errors.

If you're just dealing with HTML, there is the HTML Agility Pack, which may serve your purposes.

回答4:

Depending ont he specific needs, you might be able to use HTML Tidy to cleanup the document, then import it using the XMLDocument object.

回答5:

What you are trying to do is very difficult. HTML cannot be parsed using an XML parser since XML is strict and HTML is not. If that HTML were compliant XHTML (HTML as XML), then an XML parser would parse the HTML without issue.

You might want to see if there are any HTML to XHTML converters out there, if you really want to use an XML parser for HTML.

In other words, I have yet to meet an XML parser that handles malformed XML... they are not designed to accept loose markup like HTML (for good reason, too :) )