Unclear how a yacc/bison production spec can cause

2019-09-10 10:55发布

问题:

This is not homework, but it is from a book. I'm given the following grammar:

%{
#include <stdio.h>
#include <ctype.h>

int yylex();
int yyerror();
%}

%%

command : exp '\n' { printf("%d\n", $1); exit(0); }
        | error '\n'
          { 
            yyerrok;
            printf("reenter expression: "); 
          }
          command
        ;

exp : exp '+' term { $$ = $1 + $3; }
    | exp '-' term { $$ = $1 - $3; }
    | term { $$ = $1; }
    ;

term : term '*' factor { $$ = $1 * $3; }
     | factor { $$ = $1; }
     ;

factor : NUMBER { $$ = $1; }
       | '(' exp ')' { $$ = $2; }
       ;

%%

int main() {
  return yyparse();
}

int yylex() {
  int c;

  /* eliminate blanks*/
  while((c = getchar()) == ' ');

  if (isdigit(c)) {
    ungetc(c, stdin);
    scanf("%d\n", &yylval);
    return (NUMBER);
  }

  /* makes the parse stop */
  if (c == '\n') return 0;

  return (c);
}

int yyerror(char * s) {
  fprintf(stderr, "%s\n", s);
  return 0;
} /* allows for printing of an error message */

Here is the task:

The simple error recovery technique suggested for the calculator program is flawed in that it could cause stack overflow after many errors. Rewrite it to remove this problem.

I can't really figure out how a stack overflow can occur. Given the starting production is the only one that has an error token in it, wouldn't yacc/bison pop all the elements on the stack and before restarting?

回答1:

When in doubt, the simplest thing is to use bison.

I modified the program slightly in order to avoid the bugs. First, since the new program relies on seeing '\n' tokens, I removed the line if (c == '\n') return 0; which would suppress sending '\n'. Second, I fixed scanf("%d\n", &yylval); to scanf("%d", &yylval);. There's no reason to swallow the whitespace following the number, particularly if the whitespace following the number is a newline. (However, scanf patterns don't distinguish between different kinds of whitespace, so the pattern "%d\n" has exactly the same semantics as "%d ". Neither of those would be correct.)

Then I added the line yydebug = 1; at the top of main and supplied the -t ("trace") option to bison when I built the calculator. That causes the parser to show its progress in detail as it processes the input.

It helps to get a state table dump in order to see what's going on. You can do that with the -v bison option. I'll leave that for readers, though.

Then I ran the program and deliberately typed an syntax error:

./error
Starting parse
Entering state 0
Reading a token: 2++3

The trace facility has already output two lines, but after I give it some input, the trace comes pouring out.

First, the parser absorbs the NUMBER 2 and the operator +: (Note: nterm below is bison's way of saying "non-terminal", while token is a "terminal"; the stack shows only state numbers.)

Next token is token NUMBER ()
Shifting token NUMBER ()
Entering state 2
Reducing stack by rule 9 (line 25):
   $1 = token NUMBER ()
-> $$ = nterm factor ()
Stack now 0
Entering state 7
Reducing stack by rule 8 (line 22):
   $1 = nterm factor ()
-> $$ = nterm term ()
Stack now 0
Entering state 6
Reading a token: Next token is token '+' ()
Reducing stack by rule 6 (line 18):
   $1 = nterm term ()
-> $$ = nterm exp ()
Stack now 0
Entering state 5
Next token is token '+' ()
Shifting token '+' ()
Entering state 12

So far, so good. State 12 is where we get to after we've seen +; here is its definition:

State 12

    4 exp: exp '+' . term
    7 term: . term '*' factor
    8     | . factor
    9 factor: . NUMBER
   10       | . '(' exp ')'

    NUMBER  shift, and go to state 2
    '('     shift, and go to state 3

    term    go to state 17
    factor  go to state 7

(By default, bison doesn't clutter up the state table with non-core items. I added -r itemset to get the full itemset, but it would have been easy enough to do the closure by hand.)

Since in this state we're looking for the right-hand operand of +, only things which can start an expression are valid: NUMBER and (. But that's not what we've got:

Reading a token: Next token is token '+' ()
syntax error

OK, we're in State 12, and if you look at the above state description, you'll see that error is not in the lookahead set either. So:

Error: popping token '+' ()
Stack now 0 5

That puts us back in State 5, which is where an operator was expected:

State 5

    1 command: exp . '\n'
    4 exp: exp . '+' term
    5    | exp . '-' term

    '\n'  shift, and go to state 11
    '+'   shift, and go to state 12
    '-'   shift, and go to state 13

So that state doesn't have a transition on error either. Onwards.

Error: popping nterm exp ()
Stack now 0

OK, back to the beginning. State 0 does have an error transition:

   error   shift, and go to state 1

So now we can shift the error token and enter state 1, as indicated by the transition table:

Shifting token error ()
Entering state 1

Now we need to synchronize the input by skipping input tokens until we get to a newline token. (Note that bison actually pops and pushes the error token while it's doing this. Try not to let that distract you.)

Next token is token '+' ()
Error: discarding token '+' ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token NUMBER ()
Error: discarding token NUMBER ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token '\n' ()
Shifting token '\n' ()
Entering state 8

Right, we found the newline. State 5 is command: error '\n' . $@1 command. $@1 is the name of the marker (empty production) which bison inserted in place of the mid-rule action (MRA). State 8 will reduce this marker, causing the MRA to run, which asks me for more input. Note that at this point error recovery is complete. We are now in a perfectly normal state, and the stack reflects the fact that we have, in order, the start (state 0), an error token (state 1) and a newline token (state 8):

Reducing stack by rule 2 (line 13):
-> $$ = nterm $@1 ()
Stack now 0 1 8
Entering state 15
Reading a token: Try again: 

After the MRA is reduced, the corresponding action from State 8 is taken and we proceed to State 15 (to avoid clutter, I left out the non-core items):

State 15

    3 command: error '\n' $@1 . command

    error   shift, and go to state 1
    NUMBER  shift, and go to state 2
    '('     shift, and go to state 3

So now we're ready to parse a brand new command, as expected. But we have not yet reduced the error production; it's still on the stack because it can't be reduced until the command following the dot has been reduced. And we haven't even started on it yet.

But it's important to note that State 15 does have a transition on error, as you can see from the state's goto table. It has that transition because the closure includes the two productions for command:

    1 command: . exp '\n'
    3        | . error '\n' $@1 command

as well as the productions for exp, term and factor, which are also part of the closure.

So what happens if we now enter another error? The stack will be popped back to this point (0 1 8 15), a new error token will be pushed onto the stack (0 1 8 15 1), tokens will be discarded until a newline can be shifted (0 1 8 15 1 8) and a new MRA ($@1, as bison calls it) will be reduced onto the stack (0 1 8 15 1 8 15) at which point we're ready to start parsing yet another attempt.

Hopefully you can see where this is going.

Note that it really is not different from the effect of any other right-recursive production. Had the grammar attempted to accept a number of expressions:

prog: exp '\n'
    | exp '\n' { printf("%d\n", $1); } prog

you would see the same stack build-up, which is why right-recursion is discouraged. (And also because you end up inserting MRAs to avoid seeing the results in reverse order as the stack is reduced down to prog at the end of all input.)

    command  go to state 20
    exp      go to state 5
    term     go to state 6
    factor   go to state 7