Threading extra state through a parser in Scala

2019-03-09 17:55发布

I'll give you the tl;dr up front

I'm trying to use the state monad transformer in Scalaz 7 to thread extra state through a parser, and I'm having trouble doing anything useful without writing a lot of t m a -> t m b versions of m a -> m b methods.

An example parsing problem

Suppose I have a string containing nested parentheses with digits inside them:

val input = "((617)((0)(32)))"

I also have a stream of fresh variable names (characters, in this case):

val names = Stream('a' to 'z': _*)

I want to pull a name off the top of the stream and assign it to each parenthetical expression as I parse it, and then map that name to a string representing the contents of the parentheses, with the nested parenthetical expressions (if any) replaced by their names.

To make this more concrete, here's what I'd want the output to look like for the example input above:

val target = Map(
  'a' -> "617",
  'b' -> "0",
  'c' -> "32",
  'd' -> "bc",
  'e' -> "ad"
)

There may be either a string of digits or arbitrarily many sub-expressions at a given level, but these two kinds of content won't be mixed in a single parenthetical expression.

To keep things simple, we'll assume that the stream of names will never contain either duplicates or digits, and that it will always contain enough names for our input.

Using parser combinators with a bit of mutable state

The example above is a slightly simplified version of the parsing problem in this Stack Overflow question. I answered that question with a solution that looked roughly like this:

import scala.util.parsing.combinator._

class ParenParser(names: Iterator[Char]) extends RegexParsers {
  def paren: Parser[List[(Char, String)]] = "(" ~> contents <~ ")" ^^ {
    case (s, m) => (names.next -> s) :: m
  }

  def contents: Parser[(String, List[(Char, String)])] = 
    "\\d+".r ^^ (_ -> Nil) | rep1(paren) ^^ (
      ps => ps.map(_.head._1).mkString -> ps.flatten
    )

  def parse(s: String) = parseAll(paren, s).map(_.toMap)
}

It's not too bad, but I'd prefer to avoid the mutable state.

What I want

Haskell's Parsec library makes adding user state to a parser trivially easy:

import Control.Applicative ((*>), (<$>), (<*))
import Data.Map (fromList)
import Text.Parsec

paren = do
  (s, m) <- char '(' *> contents <* char ')'
  h : t  <- getState
  putState t
  return $ (h, s) : m
  where
    contents
      =  flip (,) []
     <$> many1 digit
     <|> (\ps -> (map (fst . head) ps, concat ps))
     <$> many1 paren

main = print $
  runParser (fromList <$> paren) ['a'..'z'] "example" "((617)((0)(32)))"

This is a fairly straightforward translation of my Scala parser above, but without mutable state.

What I've tried

I'm trying to get as close to the Parsec solution as I can using Scalaz's state monad transformer, so instead of Parser[A] I'm working with StateT[Parser, Stream[Char], A]. I have a "solution" that allows me to write the following:

import scala.util.parsing.combinator._
import scalaz._, Scalaz._

object ParenParser extends ExtraStateParsers[Stream[Char]] with RegexParsers {
  protected implicit def monadInstance = parserMonad(this)

  def paren: ESP[List[(Char, String)]] = 
    (lift("(" ) ~> contents <~ lift(")")).flatMap {
      case (s, m) => get.flatMap(
        names => put(names.tail).map(_ => (names.head -> s) :: m)
      )
    }

  def contents: ESP[(String, List[(Char, String)])] =
    lift("\\d+".r ^^ (_ -> Nil)) | rep1(paren).map(
      ps => ps.map(_.head._1).mkString -> ps.flatten
    )

  def parse(s: String, names: Stream[Char]) =
    parseAll(paren.eval(names), s).map(_.toMap)
}

This works, and it's not that much less concise than either the mutable state version or the Parsec version.

But my ExtraStateParsers is ugly as sin—I don't want to try your patience more than I already have, so I won't include it here (although here's a link, if you really want it). I've had to write new versions of every Parser and Parsers method I use above for my ExtraStateParsers and ESP types (rep1, ~>, <~, and |, in case you're counting). If I had needed to use other combinators, I'd have had to write new state transformer-level versions of them as well.

Is there a cleaner way to do this? I'd love to see an example of a Scalaz 7's state monad transformer being used to thread state through a parser, but Scalaz 6 or Haskell examples would also be useful and appreciated.

1条回答
贼婆χ
2楼-- · 2019-03-09 18:20

Probably the most general solution would be to rewrite Scala's parser library to accommodate monadic computations while parsing (like you partly did), but that would be quite a laborious task.

I suggest a solution using ScalaZ's State where each of our result isn't a value of type Parse[X], but a value of type Parse[State[Stream[Char],X]] (aliased as ParserS[X]). So the overall parsed result isn't a value, but a monadic state value, which is then run on some Stream[Char]. This is almost a monad transformer, but we have to do lifting/unlifting manually. It makes the code a bit uglier, as we need to lift values sometimes or use map/flatMap on several places, but I believe it's still reasonable.

import scala.util.parsing.combinator._
import scalaz._
import Scalaz._
import Traverse._

object ParenParser extends RegexParsers with States {
  type S[X] = State[Stream[Char],X];
  type ParserS[X] = Parser[S[X]];


  // Haskell's `return` for States
  def toState[S,X](x: X): State[S,X] = gets(_ => x)

  // Haskell's `mapM` for State
  def mapM[S,X](l: List[State[S,X]]): State[S,List[X]] =
    l.traverse[({type L[Y] = State[S,Y]})#L,X](identity _);

  // .................................................

  // Read the next character from the stream inside the state
  // and update the state to the stream's tail.
  def next: S[Char] = state(s => (s.tail, s.head));


  def paren: ParserS[List[(Char, String)]] =
    "(" ~> contents <~ ")" ^^ (_ flatMap {
      case (s, m) => next map (v => (v -> s) :: m)
    })


  def contents: ParserS[(String, List[(Char, String)])] = digits | parens;
  def digits: ParserS[(String, List[(Char, String)])] =
    "\\d+".r ^^ (_ -> Nil) ^^ (toState _)
  def parens: ParserS[(String, List[(Char, String)])] =
    rep1(paren) ^^ (mapM _) ^^ (_.map(
        ps => ps.map(_.head._1).mkString -> ps.flatten
      ))


  def parse(s: String): ParseResult[S[Map[Char,String]]] =
    parseAll(paren, s).map(_.map(_.toMap))

  def parse(s: String, names: Stream[Char]): ParseResult[Map[Char,String]] =
    parse(s).map(_ ! names);
}

object ParenParserTest extends App {
  {
    println(ParenParser.parse("((617)((0)(32)))", Stream('a' to 'z': _*)));
  }
}

Note: I believe that your approach with StateT[Parser, Stream[Char], _] isn't conceptually correct. The type says that we're constructing a parser given some state (a stream of names). So it would be possible that given different streams we get different parsers. This is not what we want to do. We only want that the result of parsing depends on the names, not the whole parser. In this way Parser[State[Stream[Char],_]] seems to be more appropriate (Haskell's Parsec takes a similar approach, the state/monad is inside the parser).

查看更多
登录 后发表回答