Is there a convenient way to use Scala's parser combinators to parse languages where indentation is significant? (e.g. Python)
问题:
回答1:
Let's assume we have a very simple language where this is a valid program
block
inside
the
block
and we want to parse this into a List[String]
with each line inside the block as one String
.
We first define a method that takes a minimum indentation level and returns a parser for a line with that indentation level.
def line(minIndent:Int):Parser[String] =
repN(minIndent + 1,"\\s".r) ~ ".*".r ^^ {case s ~ r => s.mkString + r}
Then we define a block with a minimum indentation level by repeating the line parser with a suitable separator between lines.
def lines(minIndent:Int):Parser[List[String]] =
rep1sep(line(minIndent), "[\n\r]|(\n\r)".r)
Now we can define a parser for our little language like this:
val block:Parser[List[String]] =
(("\\s*".r <~ "block\\n".r) ^^ { _.size }) >> lines
It first determines the current indentation level and then passes that as the minimum to the lines parser. Let's test it:
val s =
"""block
inside
the
block
outside
the
block"""
println(block(new CharSequenceReader(s)))
And we get
[4.10] parsed: List( inside, the, block)
For all of this to compile, you need these imports
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.input.CharSequenceReader
And you need to put everything into an object that extends RegexParsers
like so
object MyParsers extends RegexParsers {
override def skipWhitespace = false
....
回答2:
From what I know, no, the Scala parser combinators don't have support for this kind of thing out of the box. You can certainly do it by parsing white space in a meaningful way, but you will encounter some problems since you need some form of state machine to keep track of the indentation stack.
I would recommend doing a preprocessing step. Here is a little preprocessor which adds markers to separate indented blocks:
object Preprocessor {
val BlockStartToken = "{"
val BlockEndToken = "}"
val TabSize = 4 //how many spaces does a tab take
def preProcess(text: String): String = {
val lines = text.split('\n').toList.filterNot(_.forall(isWhiteChar))
val processedLines = BlockStartToken :: insertTokens(lines, List(0))
processedLines.mkString("\n")
}
def insertTokens(lines: List[String], stack: List[Int]): List[String] = lines match {
case List() => List.fill(stack.length) { BlockEndToken } //closing all opened blocks
case line :: rest => {
(computeIndentation(line), stack) match {
case (indentation, top :: stackRest) if indentation > top => {
BlockStartToken :: line :: insertTokens(rest, indentation :: stack)
}
case (indentation, top :: stackRest) if indentation == top =>
line :: insertTokens(rest, stack)
case (indentation, top :: stackRest) if indentation < top => {
BlockEndToken :: insertTokens(lines, stackRest)
}
case _ => throw new IllegalStateException("Invalid algorithm")
}
}
}
private def computeIndentation(line: String): Int = {
val whiteSpace = line takeWhile isWhiteChar
(whiteSpace map {
case ' ' => 1
case '\t' => TabSize
}).sum
}
private def isWhiteChar(ch: Char) = ch == ' ' || ch == '\t'
}
Execution for this text gives:
val text =
"""
|line1
|line2
| line3
| line4
| line5
| line6
| line7
| line8
| line9
|line10
| line11
| line12
| line13
""".stripMargin
println(Preprocessor.preProcess(text))
... the following result
{
line1
line2
{
line3
line4
line5
{
line6
line7
}
}
{
line8
line9
}
line10
{
line11
line12
line13
}
}
And afterwords you can use the combinator library to do the parsing in a simpler fashion.
Hope this helps