Anyone know of an HTML parser for VB.NET or C#? I know .NET has a lot of XML support, like XMLReader and XMLWriter. Is there an HTMLWriter or HTMLReader?
Ultimately what I'd like is a library that will parser an HTML file and raise events based on the tags it finds. Anyone know of a library to do this?
The HTML Agility Pack is the way to go if you want to parse HTML (it even does good job on tag soup). Theoretically, the XML parser included in the BCL should be able to parse valid XHTML, but the HTML Agility Pack is a generic solution that can handle ordinary HTML, XHTML, and messy variants of both.
Raising events when finding tags is something you'll have to implement yourself of course, but it should be fairly trivial using the
HtmlReader
class.I wrote this HtmlParser a long time ago and I just released it as an open source project on GitHub. It's faster than typical HTML parsing tools because it doesn't build the DOM. It does exactly what you asked for and raises "events" for each tag.
https://github.com/calbucci/CalbucciLib.HtmlParser
I just added it to NuGet:
https://www.nuget.org/packages/CalbucciLib.HtmlParser/