Html Agility Pack was given as the answer to a StackOverflow question some time ago, is it still the best option? What other options should be considered? Is there something more lightweight?
相关问题
- Views base64 encoded blob in HTML with PHP
- Sorting 3 numbers without branching [closed]
- Graphics.DrawImage() - Throws out of memory except
- Generic Generics in Managed C++
- Why am I getting UnauthorizedAccessException on th
The Html Agility Pack is still an outstanding solution for parsing HTML.
Best? well that all depends on the task at hand, but generally I think so. There are occasions when it does fall short of being ideal, but generally it will do a great job.
You could try this: http://csharptest.net/browse/src/Library/Html/ It's nothing more than a hand-full of source files that pick apart HTML/XML via Regex. It supports a light-weight DOM and XPath but not much else. (help contents)
[Example]
Alternatively you can use the parser directly instead of building a DOM tree. Just implement the IXmlLightReader interface, and call the static XmlLightParser.Parse method.
PS: It was written to solve an in-house debate: that Regex can parse HTML! Since then we have actually found many uses for it since it is lightweight enough to embed anywhere. There are still ways to confuse the DOM heirarchy builder, but I haven't found any HTML the parser won't handle.
best is a very relative term, for your question, i imagine you are searching for a reliable tool, so i think this feature should be taken into consideration. I would look for the support and strength of the company that provides the tool. It's a horrible feeling when you try to contact support for any tool that uses and the answer is, this company no longer exists. As HAP is maintained by the developer community, I would rather trust her.
I have used this before, pretty easy-to-flow api. I think in C#/.Net domain, this is a very good choice.
There is a java library here. Looks pretty good even though I don't have personal experience.
If you are prepared to look outside the
.NET
world, thePython
SO community recommends Beautiful Soup, for example html-parser-in-python.When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nu parser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQuery about a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).
I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.
The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.
There is also AngleSharp