I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.
On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.
Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader
class? i.e. in a forward only streaming manner
I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader
Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.
The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.
If the documents that you'll parsed is well formed, why don't you use the
XmlReader
?