What are fast XML parsers for Ruby? [closed]

2019-02-17 14:09发布

I am using Nokogiri which works for small documents well. But for a 180KB HTML file I have to increase the process stack size, via ulimit -s, and the parsing and XPath queries take a long time.

Are there faster methods available using a stock Ruby distribution?

I am getting used to XPath, but the solution does not necessarily need to support XPath.

The criteria are:

  1. Fast to write.
  2. Fast execution.
  3. Robust resulting parser.

标签: ruby xml parsing
5条回答
Emotional °昔
2楼-- · 2019-02-17 14:19

Check out the Ox gem. It is faster than LibXML and Nokogiri and supports in memory parsing as well as SAX callback parsing. Full disclosure, I wrote it.


In the performance comparison http://www.ohler.com/software/thoughts/Blog/Entries/2011/9/21_XML_with_Ruby.html both a DOM (in memory) and SAX (callback) parsers are compared.

查看更多
太酷不给撩
3楼-- · 2019-02-17 14:23

Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.

查看更多
我只想做你的唯一
4楼-- · 2019-02-17 14:25

You may find that for larger XML documents DOM parsing is not very performant. This is because the parser has to build an in-memory map of the structure of the XML document.

The other approach that generally requires a smaller memory footprint is to use an event-driven SAX parser.

Nokogiri has full support for SAX.

查看更多
迷人小祖宗
6楼-- · 2019-02-17 14:39

Depending on your environment, Oga may be better suited as a fast enough XML parsers for Ruby with a much better interface and faster installation time.

查看更多
登录 后发表回答