-->

Recursive definitions with scala-parser-combinator

2020-02-12 20:03发布

问题:

I have been trying to build a SQL parser with the scala-parser-combinator library, which I've simplified greatly into the code below.

class Expression
case class FalseExpr() extends Expression
case class TrueExpr() extends Expression
case class AndExpression(expr1: Expression, expr2: Expression) extends Expression

object SimpleSqlParser {
  def parse(sql: String): Try[Expression] = new SimpleSqlParser().parse(sql)
}

class SimpleSqlParser extends RegexParsers {
  def parse(sql: String): Try[_ <: Expression] = parseAll(expression, sql) match {
    case Success(matched,_) => scala.util.Success(matched)
    case Failure(msg,remaining) => scala.util.Failure(new Exception("Parser failed: "+msg + "remaining: "+ remaining.source.toString.drop(remaining.offset)))
    case Error(msg,_) => scala.util.Failure(new Exception(msg))
  }

  private def expression: Parser[_ <: Expression] =
    andExpr | falseExpr | trueExpr

  private def falseExpr: Parser[FalseExpr] =
    "false" ^^ (_ => FalseExpr())

  private def trueExpr: Parser[TrueExpr] = "true" ^^ (_ => TrueExpr())

  private def andExpr: Parser[Expression] =
    expression ~ "and" ~ expression ^^ { case e1 ~ and ~ e2 => AndExpression(e1,e2)}
}

Without the 'and' parsing, it works fine. But I want to be able to parse things like 'true AND (false OR true)', for example. When I add the 'and' part to the definition of an expression, I get a StackOverflowError, the stack is alternating between the definitions of 'and' and 'expression'.

I understand why this is happening - the definition of expression begins with and, and vice-versa. But this seems like the most natural way to model this problem. In reality an expression could also be LIKE, EQUALS etc. Is there another way to model this kind of thing in general in order to get around the problem of recursive definitions.

回答1:

scala.util.parsing.combinator.RegexParsers cannot handle left-recursive grammars. Your grammar can be summarized by the following production rules:

expression -> andExpr | falseExpr | trueExpr
...
andExpr -> expression "and" expression

expression is indirectly left-recursive via andExpr.

To avoid the infinite recursion, you need to reformulate the grammar so that it is not left-recursive anymore. One frequently-used way is to use repetition combinators, such as chainl1:

  private def expression: Parser[_ <: Expression] =
    chainl1(falseExpr | trueExpr, "and" ^^^ { AndExpression(_, _) })

Live on Scastie

The new expression matches one or more falseExpr/trueExpr, separated by "and", and combines the matched elements with AndExpression in a left-associative way. Conceptually, it corresponds to the following production rule:

expression -> (falseExpr | trueExpr) ("and" (falseExpr | trueExpr))*

If your grammar contains many tangled left-recursive production rules, you might want to consider other parser combinator libraries, such as GLL combinators, that directly support left recursion.