I've noticed a distinct lack of LL parsers that create parsers in functional languages. The ideal find for what I've been looking for without success is something to generate a Haskell parser for an ANTLR-style LL(*) grammar (modulo minor reformatting of the grammar), and was surprised that every last parser generator with a functional language target I found was some kind of LR parser.
I want to transition the parser of this language I'm working on which has functional features from ANTLR to self-host in the language itself, and it would help a lot if I could port to my language something almost surely correct in another functional language (preferably ones I'm familiar with, Haskell and Scala), instead of having to rewrite it entirely from scratch, though in the end I might do this, since the core language is small.
At this point more than even a solution to this is I'm very curious as to why there are no such LL(*) or even LL(k) parser generators, but many LR generators, since LL seems inherently easier.
The major reason for this is that most LL(k) parsers that are written in functional languages are just implemented using parser combinators, because the easiest path to generate a parser combinator library is recursive descent.
Haskell's parsec, attoparsec, and polyparse and Scala's stock parser combinators all produce what are effectively LL(*) parsers.
Both parsec and attoparsec require you to use an explicit try combinator to get backtracking, but this is only done for efficiency and the scala parser combinators can also deal with packrat parsing.
Consider the following fragment from the announcement of Brent Yorgey's recent unbound package:
parseAtom = parens parseTerm
<|> var <$> ident
<|> lam <$> brackets ident <*> parseTerm
it is pretty easy to see the original grammar.
LR parsers require much more complicated preprocessing to generate the tables to execute efficiently, since the direct hand encoding of one using something like recursive ascent is pretty awful.
By implementing your parser combinators as an EDSL rather than an external tool you enable greater use of advanced features of your programming language. You can make portions of the grammar higher order, build the lexer hack directly into the parser, etc. Typical LR parser generators can't do these things, or can only offer them in ad hoc ways in limited contexts because of the need to be able to emit the tables in the end.
You've inspired me to post an old hobby project to https://github.com/nrnrnr/ebnf. It supports LL(1) parser generation for Standard ML. Adapting to Haskell would not be hard, provided you can find someone who understands Icon.
Edward's comment that people tend to prefer combinators for parsing is quite correct, but there are advantages to a tool that will find FIRST and FOLLOW sets and will complain about ambiguity.
The primary advantage of EBNF is its notation: sequences, Maybe
types, and reserved words are all supported natively, with no extra fuss.
SML has had ml-antlr for a couple of years now:
http://www.classes.cs.uchicago.edu/archive/2007/winter/22610-1/docs/lpt-manual.pdf
There is also sllgen for Scheme.
As to why there are many more LR parser generators than LL ones - it's difficult to write LR parsers by hand, so you really need a generator. With LL parsers, a hand-coded implementation still matches the grammar, so there is much less need for a generator.
With Scala you could use all the existing Java tools without much overhead. JavaCC is an LL(k) parser generator. You can use it to create a concrete syntax tree automatically and do everything else in Scala. I actually did that for a small project, simply because the JavaCC grammar already existed.
Before I describe a LL(k) solution under development, I will explain why I didn't use the other available options that I am aware of.
Parser combinator libraries, i.e. recursive descent parsers, do not:
- guarantee a context-free grammar (CFG), because they do not calculate the first-sets and follow-sets.
- do efficient table-driven k tokens of look-ahead. Instead they do inefficient back-tracing.
- do not do the more efficient stack based parsing algorithm.
A lack of context-freedom is manifested as ambiguities in the syntax, e.g. whether an operator at the start of a source code line (following a line that did not send with a semicolon) is a prefix unary on the expression to its right side, or an infix binary operator on the expression at the end of the prior source code line.
JavaCC has the following disadvantages:
- conflates the lexer and the parser generation.
- overly verbose BNF grammar syntax.
- outputs Java and I want Scala (eventually Copute).
- afaics doesn't do a tight coupling of names in grammar and in the AST.
- afaics monolithic source code, e.g. can't (easily) extract modules to generate first and follow-sets (so that I could plugin my own parser generation).
I am attempting to create a LL(k) parser generator that would output to Scala, and then hopefully bootstrap to output the language I am developing (Copute, which will compile to Scala).
My current attempt is using a subset of the SLK grammar syntax, so the SLK tool can be used to verify the grammar is context-free.
You can use ANTLR which generates LL* parsers in Java (amongst others), hence .class
resp .jar
files.