Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 3 years ago.
Html Agility Pack was given as the answer to a StackOverflow question some time ago, is it still the best option? What other options should be considered? Is there something more lightweight?
There is a spreadsheet with the comparisons.
In summary:
CsQuery Performance vs. Html Agility Pack and Fizzler I put together
some performance tests to compare CsQuery to the only practical
alternative that I know of (Fizzler, an HtmlAgilityPack extension). I
tested against three different documents:
- The sizzle test document (about 11 k)
- The wikipedia entry for "cheese" (about 170 k)
- The single-page HTML 5 spec (about 6 megabytes)
The overall results are:
- HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or
perhaps it builds only a relatively simple one). CsQuery takes
anywhere from 1.1 to 2.6x longer to load the document. More on this
below.
- CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where
sometimes Fizzler is faster. For all tests, the results are completely
enumerated; this case just results in every node in the tree being
enumerated. So this doesn't test the selection engine so much as the
data structure.
- CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in
Chrome using jQuery 1.7.2, and the numbers match those returned by
CsQuery. This is probably because HtmlAgilityPack handles optional
(missing) tags differently. Additionally, nth-child is not implemented
completely in Fizzler - it only supports simple values (not formulae).
When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nu parser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQuery about a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).
I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.
The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.
There is also AngleSharp
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code. Also current features such as querySelector or querySelectorAll work for tree traversal.
Html Agility Pack was given as the answer to a StackOverflow question some time ago
The Html Agility Pack is still an outstanding solution for parsing HTML.
is it still the best option?
Best? well that all depends on the task at hand, but generally I think so. There are occasions when it does fall short of being ideal, but generally it will do a great job.
Is there something more lightweight?
You could try this: http://csharptest.net/browse/src/Library/Html/
It's nothing more than a hand-full of source files that pick apart HTML/XML via Regex. It supports a light-weight DOM and XPath but not much else. (help contents)
[Example]
public void TestParse() {
string notxml = "<html id=a ><body foo='bar' bar=\"foo\" />";
var html = new HtmlLightDocument(notxml).Root;
Assert.AreEqual("html", html.TagName);
Assert.AreEqual(1, html.Attributes.Count);
Assert.AreEqual("a", html.Attributes["id"]);
Assert.AreEqual(1, html.Children.Count);
}
Alternatively you can use the parser directly instead of building a DOM tree. Just implement the IXmlLightReader interface, and call the static XmlLightParser.Parse method.
PS: It was written to solve an in-house debate: that Regex can parse HTML! Since then we have actually found many uses for it since it is lightweight enough to embed anywhere. There are still ways to confuse the DOM heirarchy builder, but I haven't found any HTML the parser won't handle.
I have used this before, pretty easy-to-flow api. I think in C#/.Net domain, this is a very good choice.
There is a java library here. Looks pretty good even though I don't have personal experience.
best is a very relative term, for your question, i imagine you are searching for a reliable tool, so i think this feature should be taken into consideration.
I would look for the support and strength of the company that provides the tool.
It's a horrible feeling when you try to contact support for any tool that uses and the answer is, this company no longer exists.
As HAP is maintained by the developer community, I would rather trust her.
If you are prepared to look outside the .NET
world,
the Python
SO community recommends Beautiful Soup,
for example html-parser-in-python.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.