How to skip whitespace but use it as a token delim

2019-05-27 10:53发布

问题:

I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).

I am extending the RegexParsers class. If I turn on skipWhitespace the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?

UPDATE

This is a small test example where the tokens are being joined together:

import scala.util.parsing.combinator.RegexParsers

object TestParser extends RegexParsers {
  def test  = "(test" ~> name <~ ")"

  def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}

  def anyChar = letter | digit | "_".r | "-".r
  def letter = """[a-zA-Z]""".r
  def digit = """\d""".r

  def main(args: Array[String]) {

    val s = "(test hello these should not be joined and I should get an error)"

    val res = parseAll(test, s)
    res match {
      case Success(r, n) => println(r)
      case Failure(msg, n) => println(msg)
      case Error(msg, n) => println(msg)
    }

  }

}

In the above case I just get the string joined together. A similar effect is if I change test to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:

def test  = "(test" ~> (name+) <~ ")"

回答1:

White space is skipped just before every production rule. So, in this snippet:

def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}

It will skip whitespace before each letter and, even worse, each empty string for good measure (since anyChar* can be empty).

Use regular expressions (or plain strings) for each token, not each lexical element. Like this:

object TestParser extends RegexParsers {
  def test  = "(test" ~> name <~ ")"
  def name : Parser[String] = """[a-zA-Z][a-zA-Z0-9_-]*""".r

  // ...