Performance of parsers: PEG vs LALR(1) or LL(k)

2019-03-18 13:39发布

问题:

I've seen some claims that optimized PEG parsers in general cannot be faster than optimized LALR(1) or LL(k) parsers. (Of course, performance of parsing would depend on a particular grammar.)

I'd like to know if there are any specific limitations of PEG parsers, either valid in general or for some subsets of PEG grammars that would make them inferior to LALR(1) or LL(k) performance-wise.

In particular, I'm interested in parser generators, but assume that their output can be tweaked for performance in any particular case. I also assume that parsers are optimized and it is possible to tweak a particular grammar a bit if that's needed to improve performance.

回答1:

Found a good answer about Packrat vs LALR parsing. Some quotes from it:

L(AL)R parsers are linear time parsers, too. So in theory, neither packrat nor L(AL)R parsers are "faster".

What matters, in practice, of course, is implementation. L(AL)R state transitions can be executed in very few machine instructions ("look token code up in vector, get next state and action") so they can be extremely fast in practice.

An observation: most language front-ends don't spend most of their time "parsing"; rather, they spend a lot of time in lexical analysis. Optimize that ..., and the parser speed won't matter much.



回答2:

PEG parsers can use unlimited lookahead (while maintaining linear parse time on average, via packrat) unlike (default) LL(k), or LR(k) parsers which use limited lookahead, while maintining linear parse time.

Lately (2014-2015) ANTLR4 has made extensions to handle arbitrary lookahead (as in PEG) while maintaining linear parse time on average (said to be more efficient than packrat algorithm), however this is incorporates new extensions and variations of the LR parsing algorithm (and not the default LR algorithm).

The packrat parser (and associated parsers for LL, LR) is not necesarily practical, but provides theoretical bounds on parsing so comparison can be made.

But note that unlimited lookahead can be used to parse grammars/languages in linear time (e.g via packrat or antlr) which are not possible to parse via LL(k) or LR(k) even in non-linear time, So it is important to understand what is compared to what.