Is the Html Agility Pack still the best .NET HTML

Html Agility Pack was given as the answer to a StackOverflow question some time ago, is it still the best option? What other options should be considered? Is there something more lightweight?

标签： c# .net html parsing html-agility-pack

7条回答

你好瞎i

2楼-- · 2020-01-30 03:30

Html Agility Pack was given as the answer to a StackOverflow question some time ago

The Html Agility Pack is still an outstanding solution for parsing HTML.

is it still the best option?

Best? well that all depends on the task at hand, but generally I think so. There are occasions when it does fall short of being ideal, but generally it will do a great job.

Is there something more lightweight?

You could try this: http://csharptest.net/browse/src/Library/Html/ It's nothing more than a hand-full of source files that pick apart HTML/XML via Regex. It supports a light-weight DOM and XPath but not much else. (help contents)

[Example]

public void TestParse() {
        string notxml = "<html id=a ><body foo='bar' bar=\"foo\" />";
        var html = new HtmlLightDocument(notxml).Root;

        Assert.AreEqual("html", html.TagName);
        Assert.AreEqual(1, html.Attributes.Count);
        Assert.AreEqual("a", html.Attributes["id"]);
        Assert.AreEqual(1, html.Children.Count);
}

Alternatively you can use the parser directly instead of building a DOM tree. Just implement the IXmlLightReader interface, and call the static XmlLightParser.Parse method.

PS: It was written to solve an in-house debate: that Regex can parse HTML! Since then we have actually found many uses for it since it is lightweight enough to embed anywhere. There are still ways to confuse the DOM heirarchy builder, but I haven't found any HTML the parser won't handle.

0人赞添加讨论(0) 举报

ゆ、 Hurt°

3楼-- · 2020-01-30 03:31

best is a very relative term, for your question, i imagine you are searching for a reliable tool, so i think this feature should be taken into consideration. I would look for the support and strength of the company that provides the tool. It's a horrible feeling when you try to contact support for any tool that uses and the answer is, this company no longer exists. As HAP is maintained by the developer community, I would rather trust her.

0人赞添加讨论(0) 举报

淡お忘

4楼-- · 2020-01-30 03:35

I have used this before, pretty easy-to-flow api. I think in C#/.Net domain, this is a very good choice.

There is a java library here. Looks pretty good even though I don't have personal experience.

0人赞添加讨论(0) 举报

Root（大扎）

5楼-- · 2020-01-30 03:39

If you are prepared to look outside the .NET world, the Python SO community recommends Beautiful Soup, for example html-parser-in-python.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

6楼-- · 2020-01-30 03:40

When it comes to HTML parsing, there's no comparison to the real thing. This is a C# port of the validator.nu parser. This is the same code base used by Gecko-based browsers (e.g. Firefox). There repo looks a bit dusty but don't be fooled.. the port is outstanding. It's just been overlooked. I integrated it into CsQuery about a month ago. It passes all the CsQuery tests (which include most of the jQuery and Sizzle tests ported to C#).

I'm not aware of any other HTML5 parsers written in C#, or even any that come remotely close to doing a good job in terms of missing, optional, and invalid tag handling. This doesn't just do a great job though - it's standards compliant.

The repo I linked to above is the original port, it includes a basic wrapper that produces an XML node tree. CsQuery versions 1.3 and higher use this parser.

0人赞添加讨论(0) 举报

萌系小妹纸

7楼-- · 2020-01-30 03:48

There is also AngleSharp

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code. Also current features such as querySelector or querySelectorAll work for tree traversal.

0人赞添加讨论(0) 举报

1 2 下一页

Is the Html Agility Pack still the best .NET HTML

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间