Parsing an indentation based language using scala

2019-04-25 00:49发布

问题:

Is there a convenient way to use Scala's parser combinators to parse languages where indentation is significant? (e.g. Python)

回答1:

Let's assume we have a very simple language where this is a valid program

block
  inside
  the
  block

and we want to parse this into a List[String] with each line inside the block as one String.

We first define a method that takes a minimum indentation level and returns a parser for a line with that indentation level.

def line(minIndent:Int):Parser[String] = 
  repN(minIndent + 1,"\\s".r) ~ ".*".r ^^ {case s ~ r => s.mkString + r}

Then we define a block with a minimum indentation level by repeating the line parser with a suitable separator between lines.

def lines(minIndent:Int):Parser[List[String]] =
  rep1sep(line(minIndent), "[\n\r]|(\n\r)".r)

Now we can define a parser for our little language like this:

val block:Parser[List[String]] =
  (("\\s*".r <~ "block\\n".r) ^^ { _.size }) >> lines

It first determines the current indentation level and then passes that as the minimum to the lines parser. Let's test it:

val s =
"""block
    inside
    the
    block
outside
the
block"""

println(block(new CharSequenceReader(s)))

And we get

[4.10] parsed: List(    inside,     the,     block)

For all of this to compile, you need these imports

import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.input.CharSequenceReader

And you need to put everything into an object that extends RegexParsers like so

object MyParsers extends RegexParsers {
  override def skipWhitespace = false
  ....


回答2:

From what I know, no, the Scala parser combinators don't have support for this kind of thing out of the box. You can certainly do it by parsing white space in a meaningful way, but you will encounter some problems since you need some form of state machine to keep track of the indentation stack.

I would recommend doing a preprocessing step. Here is a little preprocessor which adds markers to separate indented blocks:

object Preprocessor {

    val BlockStartToken = "{"
    val BlockEndToken = "}"

    val TabSize = 4 //how many spaces does a tab take

    def preProcess(text: String): String = {
        val lines = text.split('\n').toList.filterNot(_.forall(isWhiteChar))
        val processedLines = BlockStartToken :: insertTokens(lines, List(0))
        processedLines.mkString("\n")
    }

    def insertTokens(lines: List[String], stack: List[Int]): List[String] = lines match {
        case List() => List.fill(stack.length) { BlockEndToken } //closing all opened blocks
        case line :: rest => {
            (computeIndentation(line), stack) match {
                case (indentation, top :: stackRest) if indentation > top => {
                    BlockStartToken :: line :: insertTokens(rest,  indentation :: stack)
                }
                case (indentation, top :: stackRest) if indentation == top =>
                    line :: insertTokens(rest, stack)
                case (indentation, top :: stackRest) if indentation < top => {
                    BlockEndToken :: insertTokens(lines, stackRest)
                }
                case _ => throw new IllegalStateException("Invalid algorithm")
            }
        }
    }


    private def computeIndentation(line: String): Int = {
        val whiteSpace = line takeWhile isWhiteChar
        (whiteSpace map {
            case ' ' => 1
            case '\t' => TabSize
        }).sum
    }

    private def isWhiteChar(ch: Char) = ch == ' ' || ch == '\t'
}

Execution for this text gives:

val text =
    """
      |line1
      |line2
      |    line3
      |    line4
      |    line5
      |        line6
      |        line7
      |  line8
      |  line9
      |line10
      |   line11
      |   line12
      |   line13
    """.stripMargin
println(Preprocessor.preProcess(text))

... the following result

{
line1
line2
{
    line3
    line4
    line5
{
        line6
        line7
}
}
{
  line8
  line9
}
line10
{
   line11
   line12
   line13
}
}

And afterwords you can use the combinator library to do the parsing in a simpler fashion.

Hope this helps