I am trying to build a small parser where the tokens (luckily) never contain whitespace. Whitespace (spaces, tabs and newlines) are essentially token delimeters (apart from cases where there are brackets etc.).
I am extending the RegexParsers
class. If I turn on skipWhitespace
the parser is greedily joining tokens together when the next token matches the regular expression of the previous one. If I turn off skipWhitespace
, on the other hand, it complains because of the spaces not being part of the definition. I am trying to match the BNF as much as possible, and given that whitespace is almost always the delimeter (apart from brackets or some other cases where the delimeter is explicitly defined in the BNF), is there away to avoid putting whitespace regex in all my definitions?
UPDATE
This is a small test example where the tokens are being joined together:
import scala.util.parsing.combinator.RegexParsers
object TestParser extends RegexParsers {
def test = "(test" ~> name <~ ")"
def name : Parser[String] = (letter ~ (anyChar*)) ^^ { case first ~ rest => (first :: rest).mkString}
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
def main(args: Array[String]) {
val s = "(test hello these should not be joined and I should get an error)"
val res = parseAll(test, s)
res match {
case Success(r, n) => println(r)
case Failure(msg, n) => println(msg)
case Error(msg, n) => println(msg)
}
}
}
In the above case I just get the string joined together.
A similar effect is if I change test
to the following, expecting it to give me the list of separate words after test, but instead it joins them together and just gives me a one element list with a long string, without the middle spaces:
def test = "(test" ~> (name+) <~ ")"
White space is skipped just before every production rule. So, in this snippet:
It will skip whitespace before each letter and, even worse, each empty string for good measure (since
anyChar*
can be empty).Use regular expressions (or plain strings) for each token, not each lexical element. Like this: