Recursive descent parser implementation

2019-02-03 06:46发布

I am looking to write some pseudo-code of a recursive descent parser. Now, I have no experience with this type of coding. I have read some examples online but they only work on grammar that uses mathematical expressions. Here is the grammar I am basing the parser on.

S -> if E then S | if E then S else S | begin S L | print E

L -> end | ; S L

E -> i

I must write methods S(), L() and E() and return some error messages, but the tutorials I have found online have not helped a lot. Can anyone point me in the right direction and give me some examples?.

I would like to write it in C# or Java syntax, since it is easier for me to relate.


Update

public void S() {
    if (currentToken == "if") {
        getNextToken();
        E();

        if (currentToken == "then") {
            getNextToken();
            S();

            if (currentToken == "else") {
                getNextToken();
                S();
                Return;
            }
        } else {
            throw new IllegalTokenException("Procedure S() expected a 'then' token " + "but received: " + currentToken);
        } else if (currentToken == "begin") {
            getNextToken();
            S();
            L();
            return;
        } else if (currentToken == "print") {
            getNextToken();
            E();
            return;
        } else {
            throw new IllegalTokenException("Procedure S() expected an 'if' or 'then' or else or begin or print  token " + "but received: " + currentToken);
        }
    }
}


public void L() {
    if (currentToken == "end") {
        getNextToken();
        return;
    } else if (currentToken == ";") {
        getNextToken();
        S();
        L();
        return;
    } else {
        throw new IllegalTokenException("Procedure L() expected an 'end' or ';' token " + "but received: " + currentToken);
    }
}


public void E() {
    if (currentToken == "i") {
        getNextToken();
        return;
    } else {
        throw new IllegalTokenException("Procedure E() expected an 'i' token " + "but received: " + currentToken);
    }
}

3条回答
Root(大扎)
2楼-- · 2019-02-03 07:12

This is not the easiest grammar to start with because you have an unlimited amount of lookahead on your first production rule:

S -> if E then S | if E then S else S |  begin S L | print E

consider

if 5 then begin begin begin begin ...

When do we determine this stupid else?

also, consider

if 5 then if 4 then if 3 then if 2 then print 2 else ...

Now, was that else supposed to bind to the if 5 then fragment? If not, that's actually cool, but be explicit.

You can rewrite your grammar (possibly, depending on else rule) equivalently as:

S -> if E then S (else S)? | begin S L | print E
L -> end | ; S L
E -> i

Which may or may not be what you want. But the pseudocode sort of jumps out from this.

define S() {
  if (peek()=="if") {
    consume("if")
    E()
    consume("then")
    S()
    if (peek()=="else") {
      consume("else")
      S()
    }
  } else if (peek()=="begin") {
    consume("begin")
    S()
    L()
  } else if (peek()=="print") {
    consume("print")
    E()
  } else {
    throw error()
  }
}

define L() {
  if (peek()=="end") {
    consume("end")
  } else if (peek==";")
    consume(";")
    S()
    L()
  } else {
    throw error()
  }
}

define E() {
  consume_token_i()
}

For each alternate, I created an if statement that looked at the unique prefix. The final else on any match attempt is always an error. I consume keywords and call the functions corresponding to production rules as I encounter them.

Translating from pseudocode to real code isn't too complicated, but it isn't trivial. Those peeks and consumes probably don't actually operate on strings. It's far easier to operate on tokens. And simply walking a sentence and consuming it isn't quite the same as parsing it. You'll want to do something as you consume the pieces, possibly building up a parse tree (which means these functions probably return something). And throwing an error might be correct at a high level, but you'd want to put some meaningful information into the error. Also, things get more complex if you do require lookahead.

I would recommend Language Implementation Patterns by Terence Parr (the guy who wrote antlr, a recursive descent parser generator) when looking at these kinds of problems. The Dragon Book (Aho, et al, recommended in a comment) is good, too (it is still probably the dominant textbook in compiler courses).

查看更多
Bombasti
3楼-- · 2019-02-03 07:21

I taught (really just, helped) the parsing section of a PL class last semester. I really recommend looking at the parsing slides from our page: here. Basically, for recursive descent parsing, you ask yourself the following question:

I've parsed a little bit of a nonterminal, now I'm at a point where I can make a choice as to what I'm supposed to parse next. What I see next will make decisions as to what nonterminal I'm in.

Your grammar -- by the way -- exhibits a very common ambiguity called the "dangling else," which has been around since the Algol days. In shift reduce parsers (typically produced by parser generators) this generates a shift / reduce conflict, where you typically choose to arbitrarily shift instead of reduce, giving you the common "maximal much" principal. (So, if you see "if (b) then if (b2) S1 else S2" you read it as "if (b) then { if (b2) { s1; } else { s2; } }")

Let's throw this out of your grammar, and work with a slightly simpler grammar:

T -> A + T
 |   A - T
 |   A
A -> NUM * A
   | NUM / A
   | NUM

we're also going to assume that NUM is something you get from the lexer (i.e., it's just a token). This grammar is LL(1), that is, you can parse it with a recursive descent parser implemented using a naive algorithm. The algorithm works like this:

parse_T() {
  a = parse_A();
  if (next_token() == '+') {
    next_token();  // advance to the next token in the stream
    t = parse_T();
    return T_node_plus_constructor(a, t);
  }
  else if (next_token() == '-') {
    next_token();  // advance to the next token in the stream
    t = parse_T();
    return T_node_minus_constructor(a, t);
  } else {
    return T_node_a_constructor(a);
  }
}
parse_A() {
  num = token(); // gets the current token from the token stream
  next_token();  // advance to the next token in the stream
  assert(token_type(a) == NUM);
  if (next_token() == '*') {
    a = parse_A();
    return A_node_times_constructor(num, a);
  }
  else if (next_token() == '/') {
    a = parse_A();
    return A_node_div_constructor(num, a);
  } else {
    return A_node_num_constructor(num);
  }
}

Do you see the pattern here:

for each nonterminal in your grammar: parse the first thing, then see what you have to look at to decide what you should parse next. Continue this until you're done!

Also, please note that this type of parsing typically won't produce what you want for arithmetic. Recursive descent parsers (unless you use a little trick with tail recursion?) won't produce leftmost derivations. In particular, you can't write left recursive rules like "a -> a - a" where the leftmost associativity is really necessary! This is why people typically use fancier parser generator tools. But the recursive descent trick is still worth knowing about and playing around with.

查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-02-03 07:26

Basically in recursive descent parsing each non-terminal in the grammar is translated into a procedure, then inside each procedure you check to see if the current token you are looking at matches what you would expect to see on the right hand side of the non-terminal symbol corresponding to the procedure, if it does then continue applying the production, if it doesn't then you have encountered an error and must take some action.

So in your case as you mentioned above you will have procedures: S(), L(), and E(), I will give an example of how L() could be implemented and then you can try and do S() and E() on your own.

It is also important to note that you will need some other program to tokenize the input for you, then you can just ask that program for the next token from your input.

/**
 * This procedure corresponds to the L non-terminal
 * L -> 'end'
 * L -> ';' S L
 */ 
public void L()
{
   if(currentToken == 'end')
   {
      //we found an 'end' token, get the next token in the input stream
      //Notice, there are no other non-terminals or terminals after 
      //'end' in this production so all we do now is return
      //Note: that here we return to the procedure that called L()
      getNextToken();
      return; 
   } 
   else if(currentToken == ';')
   {
      //we found an ';', get the next token in the input stream
      getNextToken();
      //Notice that there are two more non-terminal symbols after ';'
      //in this production, S and L; all we have to do is call those
      //procedures in the order they appear in the production, then return
      S();
      L();
      return;
   }
   else
   {
      //The currentToken is not either an 'end' token or a ';' token 
      //This is an error, raise some exception
      throw new IllegalTokenException(
          "Procedure L() expected an 'end' or ';' token "+
          "but received: " + currentToken);
   }
}

Now you try S() and E(), and post back.

Also as Kristopher points out your grammar has something called a dangling else, meaing that you have a production that starts with the same thing up to a point:

S -> if E then S 
S -> if E then S else S

So this begs the question if your parser sees an 'if' token, which production should it choose to process the input? The answer is it has no idea which one to choose because unlike humans the compiler can't look ahead into the input stream to search for an 'else' token. This is a simple problem to fix by applying a rule known as Left-Factoring, very similar to how you would factor an algebra problem.

All you have to do is create a new non-terminal symbol S' (S-prime) whose right hand side will hold the pieces of the productions that aren't common, so your S productions no becomes:

S  -> if E then S S'
S' -> else S 
S' -> e   
(e is used here to denote the empty string, which basically means there is no   
 input seen)
查看更多
登录 后发表回答