Is there an XmlReader equivalent for HTML in .Net?

2019-05-03 20:21发布

I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.

On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.

Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader class? i.e. in a forward only streaming manner

标签： .net html parsing html-agility-pack xmlreader

2条回答

该账号已被封号

2楼-- · 2019-05-03 20:33

I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader

Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-05-03 20:50

The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.

If the documents that you'll parsed is well formed, why don't you use the XmlReader?

0人赞添加讨论(0) 举报

Is there an XmlReader equivalent for HTML in .Net?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间