Is the Html Agility Pack still the best .NET HTML

When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nu parser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQuery about a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).

I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.

The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.

回答3:

There is also AngleSharp

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code. Also current features such as querySelector or querySelectorAll work for tree traversal.

回答4:

Html Agility Pack was given as the answer to a StackOverflow question some time ago

The Html Agility Pack is still an outstanding solution for parsing HTML.

is it still the best option?

Best? well that all depends on the task at hand, but generally I think so. There are occasions when it does fall short of being ideal, but generally it will do a great job.

Is there something more lightweight?

You could try this: http://csharptest.net/browse/src/Library/Html/ It's nothing more than a hand-full of source files that pick apart HTML/XML via Regex. It supports a light-weight DOM and XPath but not much else. (help contents)

[Example]

public void TestParse() {
        string notxml = "<html id=a ><body foo='bar' bar=\"foo\" />";
        var html = new HtmlLightDocument(notxml).Root;

        Assert.AreEqual("html", html.TagName);
        Assert.AreEqual(1, html.Attributes.Count);
        Assert.AreEqual("a", html.Attributes["id"]);
        Assert.AreEqual(1, html.Children.Count);
}

Alternatively you can use the parser directly instead of building a DOM tree. Just implement the IXmlLightReader interface, and call the static XmlLightParser.Parse method.

PS: It was written to solve an in-house debate: that Regex can parse HTML! Since then we have actually found many uses for it since it is lightweight enough to embed anywhere. There are still ways to confuse the DOM heirarchy builder, but I haven't found any HTML the parser won't handle.

回答5:

I have used this before, pretty easy-to-flow api. I think in C#/.Net domain, this is a very good choice.

There is a java library here. Looks pretty good even though I don't have personal experience.

回答6:

best is a very relative term, for your question, i imagine you are searching for a reliable tool, so i think this feature should be taken into consideration. I would look for the support and strength of the company that provides the tool. It's a horrible feeling when you try to contact support for any tool that uses and the answer is, this company no longer exists. As HAP is maintained by the developer community, I would rather trust her.

回答7:

If you are prepared to look outside the .NET world, the Python SO community recommends Beautiful Soup, for example html-parser-in-python.