I have encounter a problem with parsers having two branches of recursion. To demonstrate the problem easier, I use a simple grammar of a lambda calculus from the article written by Luca Bolognese as the example:
<expression> ::= <name> | <function> | <application> <name> ::= nonblank character sequence <function> ::= \ <name> . <body> <body> ::= <expression> <application> ::= ( <function expression> <argument expression> ) <function expression> ::= <expression> <argument expression> ::= <expression>
The parser in the article is quite concise:
let ws = " \t\n"
let specialChars = ".)(\\\n"
let pWs = spaces
let pName = manyChars (noneOf (ws + specialChars)) |>> EName
let pExpr, pExprRef = createParserForwardedToRef<Expression, Unit>()
let curry2 f a b = f(a,b)
let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr) (curry2 Function)
let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
.>> pWs .>> pchar ')'
do pExprRef := pFunction <|> pApplication <|> pName
let pExpressions = sepBy pExpr spaces1
let fparseString text =
match run pExpressions text with
| Success(result, _, _) -> result
| Failure(errorMsg, _, _) -> failwith (sprintf "Failure: %s" errorMsg)
I'm interested in pApplication
since they consist of two pExpr
s which in turn could be pApplication
s too. The parser runs out of stack space in the following benchmark:
let generateString level =
let rec loop i =
seq {
if i < level then
yield "("
yield! loop level
yield " "
yield! loop (i+1)
yield ")"
else
yield "(x x)"
}
loop 0 |> String.concat ""
let N = 5000
let s = generateString N;;
let _ = fparseString s;;
How can I rewrite the parser to be tail-recursive?
I recognized the problem when trying to write a parser for a Lisp-like language and test it with real benchmarks. I have Term
and VarBinding
which are mutually recursive types and a let
parser which exhibits the same issue as pApplication
above. My parser is on github in case the analysis is wrong regarding the not tail-recursive problem.
The built-in parser combinators of FParsec generally don't allow for a tail-recursive parser implementation, mainly because of the way the error handling is implemented.
This means that the recursion depth of an FParsec parser is limited by the stack size -- as is the case for most recursive descent parsers. Of course, this doesn't affect the parsing of sequences with
many
,sepBy
,chainl
etc., because these FParsec combinators all have implementations with constant stack space usage.The default stack size in .NET is usually more than enough for parsing human-generated input in well-specified formats with FParsec (assuming you parse sequences with the built-in combinators). However, machine-generated input with a deeply nested structure (like your 5000 level deep s-expressions) can easily lead to a stack overflow.
If you know in advance that the nesting depth in your input is limited to a reasonable value, you could simply increase the stack size to an appropriate value. Hopefully this is true for your benchmark data.
Otherwise you may have to implement a special purpose parser function for the recursive element(s) of your grammar. You would need to implement this function using the low-level API of FParsec and you would obviously have to implement this function such that it uses the heap instead of the stack for keeping track of the nesting context and intermediate parsing results.