Here is the (simplified) EBNF section I'm trying to implement in PetitParser:
variable :: component / identifier
component :: indexed / field
indexed :: variable , $[ , blah , $]
field :: variable , $. , identifier
What I did was to add all these productions (except identifier
) as ivars of my subclass of PPCompositeParser
and define the corresponding methods as follows:
variable
^component / self identifier
component
^indexed / field
identifier
^(#letter asParser, (#word asParser) star) flatten
indexed
^variable , $[ asParser, #digit asParser, $] asParser
field
^variable , $. asParser, self identifier
start
^variable
Finally, I created a new instance of my parser and sent to it the message parse: 'a.b[0]'
.
The problem: I get a stack overflow.
The grammar has a left recursion:
variable -> component -> indexed -> variable
. PetitParser uses Parsing Expression Grammars (PEGs) that cannot handle left recursion. A PEG parser always takes the left option until it finds a match. In this case it will not find a match due to the left recursion. To make it work you need to first eliminate left recursion. Eliminating all left recursion could be more tricky as you will also get one throughfield
after eliminating the first. For example, you can write the grammar as follows to make the left recursion more obvious:If you have a left recursion like:
you can eliminate it like (e is an empty parser)
You'll need to apply this twice to get rid of the recursion. Alternatively you can choose to simplify the grammar if you do not want to parse all possible combinations of identifiers.
The problem is that your grammar is left recursive. PetitParser uses a top-down greedy algorithm to parse the input string. If you follow the steps, you'll see that it goes from
start
thenvariable -> component -> indexed -> variable
. This is becomes a loop that gets executed infinitely without consuming any input, and is the reason of the stack overflow (that is the left-recursiveness in practice).The trick to solve the situation is to rewrite the parser by adding intermediate steps to avoid left-recursing. The basic idea is that the rewritten version will consume at least one character in each cycle. Let's start by simplifying a bit the parser refactoring the non-recursive parts of ´indexed´ and ´field´, and moving them to the bottom.
Now you can more easily see (by following the loop) that if the recursion in
variable
is to end, an identifier has to be found at the beginning. That's the only way to start, and then comes more input (or ends). Let's call that second partvariable'
:now the
variable'
actually refers to something with the identifier consumed, and we can safely move the recusion from the left ofindexed
andfield
to the right invariable'
:I've written this answer without actually testing the code, but should be okish. The parser can be further simplified, I leave that as an excercise ;).
For more information on left-recursion elimination you can have a look at left recursion elimination