Using parser combinators to collate lines of text

2019-07-20 10:47发布

问题:

I'm trying to parse a text file using parser combinators. I want to capture the index and text in a class called Example. Here's a test showing the form on an input file:

object Test extends ParsComb with App {
  val input = """
0)
blah1
blah2
blah3
1)
blah4
blah5
END
"""
  println(parseAll(examples, input))
}

And here's my attempt that doesn't work:

import scala.util.parsing.combinator.RegexParsers

case class Example(index: Int, text: String)

class ParsComb extends RegexParsers {
  def examples: Parser[List[Example]] = rep(divider~example) ^^ 
                                          {_ map {case d ~ e => Example(d,e)}}
  def divider:  Parser[Int]           = "[0-9]+".r <~ ")"    ^^ (_.toInt)
  def example:  Parser[String]        = ".*".r <~ (divider | "END") 
}

It fails with:

[4.1] failure: `END' expected but `b' found

blah2

^

I'm just starting out with these so I don't have much clue what I'm doing. I think the problem could be with the ".*".r regex not doing multi-line. How can I change this so that it parses correctly?

回答1:

  • What does the error message mean?

According to your grammar definition, ".*".r <~ (divider | "END"), you told to the parser that, an example should followed either by a divider or a END. After parsing blah1, the parser tried to find divider and failed, then tried END, failed again, there're no other options available, so the END here was the last alternative of the production value, so from the parser's perspective, it expected END, but it soon found, the next input was blah2 from the 4th line.

  • How to fix it?

Try to be close to your implementation, the grammar in your case should be:

examples ::= {divider example}
divider  ::= Integer")"
example  ::= {literal ["END"]}

and I think parsing "example" into List[String] makes more sense, anyway, it's up to you.

The problem is your example parser, it should be a repeatable literal.

So ,

class ParsComb extends RegexParsers {
  def examples: Parser[List[Example]] = rep(divider ~ example) ^^ { _ map { case d ~ e => Example(d, e) } }
  def divider: Parser[Int] = "[0-9]+".r <~ ")" ^^ (_.toInt)
  def example: Parser[List[String]] = rep("[\\w]*(?=[\\r\\n])".r <~ opt("END"))
}

the regex (?=[\\r\\n]) means it's a positive lookahead and would match characters that followed by \r or \n.

the parse result is:

[10.1] parsed: List(Example(0,List(blah1, blah2, blah3)), Example(1,List(blah4, blah5)))

If you want to parse it into a String(instead of List[String]), just add a transform function for example: ^^ {_ mkString "\n"}



回答2:

Your parser can't process newline character, your example parser eliminates next divider and your example regex matches divider and "END" string.

Try this:

object ParsComb extends RegexParsers { 
  def examples: Parser[List[Example]] = rep(divider~example) <~ """END\n?""".r ^^ {_ map {case d ~ e => Example(d,e)}} 
  def divider: Parser[Int] = "[0-9]+".r <~ ")\n" ^^ (_.toInt) 
  def example: Parser[String] = rep(str) ^^ {_.mkString}
  def str: Parser[String] = """.*\n""".r ^? { case s if simpleLine(s) => s}

  val div = """[0-9]+\)\n""".r
  def simpleLine(s: String) = s match {
    case div() => false
    case "END\n" => false
    case _ => true
  }

  def apply(s: String) = parseAll(examples, s)
}

Result:

scala> ParsComb(input)
res3: ParsComb.ParseResult[List[Example]] =
[10.1] parsed: List(Example(0,blah1
blah2
blah3
), Example(1,blah4
blah5
))


回答3:

I think the problem could be with the ".*".r regex not doing multi-line.

Exactly. Use the dotall modifier (strangely called "s"):

def example:  Parser[String]        = "(?s).*".r <~ (divider | "END")