Megaparsec: Unable to parse recursive arithmetic s

2019-03-28 04:05发布

问题:

I'm working on a small parser using Megaparsec and trying to parse arithmetic.

-- Arithmetic expressions
data Aexp = N Num 
            | V Var 
            | Mult Aexp Aexp
            | Add Aexp Aexp 
            | Sub Aexp Aexp 
             deriving (Show, Eq, Read)


arithParser :: Parser Aexp
arithParser = V <$> strParser
            <|> N <$> numParser
            <|> Mult <$> arithParser <* tok "*" <*> arithParser
--boolParser :: Parser Bexp


strParser :: Parser Var
strParser = tok "\"" *> some (noneOf ("\n\r\"=[]{},:")) <* tok "\""

numParser :: Parser Num
numParser = (some (oneOf ['0' .. '9']) >>= return . read) <* whitespace

If I run the command Parse arithParser "5*5" "5*5" it just returns Right (N 5), where it should return Mult(N 5) (N 5). Because of the precedence in the arithParser. But if I change the order then it seems to go into an infinite loop and crash.

Not sure what I'm doing wrong here, any help would be appreciated.

回答1:

Parsec tries the left alternative of <|> before it tries the right one. If the left alternative succeeds then it won't bother with the right one. So in this instance, when fed the input 5*5, Parsec's process looks like this:

  1. Try V <$> strParser. strParser begins with tok "\"", but the input string doesn't begin with '"' so strParser fails.
  2. Try N <$> numParser. numParser successfully parses the number 5, so Parsec just returns N 5.
  3. Done! No need to try the third alternative.

So we can attempt to patch this parser up by moving the Mult option up to the top, wrapped in a try so that it can backtrack and try numParser or strParser if the input turns out not to be a multiplication.

arithParser :: Parser Aexp
arithParser = try (Mult <$> arithParser <* tok "*" <*> arithParser)
            <|> N <$> numParser
            <|> V <$> strParser

This parser has another, more subtle problem. Let's walk through the steps, as above.

  1. Try try (Mult <$> arithParser <* tok "*" <*> arithParser). This parser begins with arithParser, so recursively call arithParser.
  2. Try try (Mult <$> arithParser <* tok "*" <*> arithParser). This parser begins with arithParser, so recursively call arithParser.
  3. Try try (Mult <$> arithParser <* tok "*" <*> arithParser). This parser begins with arithParser, so recursively call arithParser.
  4. ...

It's an infinite loop. Parsec can't handle left-recursive grammars. You have to design your parser so that it consumes at least one token before a recursive call. One common way of doing this is to "flatten out" your grammar:

expr, term :: Parser AExp
expr = do
    n <- term
    rest <- optional $ tok "*" *> expr
    return $ maybe n (Mult n) rest
term = N <$> numParser
    <|> V <$> strParser
    <|> parenthesised expr

parenthesised = between (char '(') (char ')')

Here I've split up the parser into one which parses an arbitrary expr - a term optionally followed by a multiplication symbol and a multiplicand expr - and one which parses single terms - numbers, strings, and parenthesised expressions. The recursive calls to expr are OK now - the one inside expr happens only after you've parsed a term (which always consumes input) and the one inside term happens only after you've parsed an opening parenthesis.

Note that expr has a list-like structure: it parses a single thing possibly followed by many things. In general you should think of parsers consuming a linear input stream of input tokens, so it's not surprising that list-shaped parsers tend to be more effective than tree-shaped ones.

The Text.Megaparsec.Expr module contains functions which package up this pattern and parse expressions with arbitrary precedence and fixity rules.

expr = makeExprParser term [[InfixR $ tok "*" $> Mult]]


回答2:

The types are lying to you: when you define a recursive parser p, you're not actually allowed to use p itself wherever you want! You need to munch part of the input first in order to guarantee that you are making progress. Otherwise Haskell will indeed happily go into an infinite loop.

This problem typically gets solved by defining different "tiers" of expressions and only allowing either "simpler" ones or parentheses-wrapped "more complex" ones in left recursive positions (because matching an open parentheses does force you to make your way through part of the input string).

E.g. the grammar for your expressions would be turned into (from simplest to most complex):

<Literal> ::= [0-9]+
<Var>     ::= [a-zA-Z]+
<Base>    ::= '(' <Expr> ')' | <Var> | <Literal>
<Factor>  ::= <Base> | <Base> '*' <Factor>
<Expr>    ::= <Factor> | <Factor> '+' <Expr> | <Factor> '-' <Expr>

This is where a total language shines: because the types have to be completely honest when it comes to termination, it becomes literally impossible to write these badly behaved left recursive parsers. The typechecker tells you that you have to find another way of recognizing the terms of your language.

For instance the fixpoint combinator fix I use in my total parser combinators library doesn't have type (a -> a) -> a but rather (ignoring the funny brackets) (□ a → a) → a which precisely prevents you from using the recursive call before you've made some progress. You can still write a parser for Expr but the typechecker is here to warn you when you're making an illegal move.