How to get identifiers scoped in a (LA)LR parser

2019-09-02 02:42发布

问题:

A disadvantage of a (LA)LR parser is that a reduce is only processed at the end of a rule. This is a problem in programming languages with scoped variables like javascript.

Example:

var a = 2;
function (a) {
   a = 4;
}

See the above code example. A parser could look like:

program : instruction program {}
        | {}
        ;

instruction : command {}
            | function {}
            ;

command : "var" identifier "=" literal ";" {}
        ;

function : "function" "(" arguments ")" "{" program "}" {/*1*/}
         ;

arguments : identifier {}
          | identifier "," arguments {}
          | {}
          ;

Now it is clear that each time an identifer is consumed by the parser. One can register the identifier in a register. The problem however is that a function (line /*1*/) is only considered at the end of the function. Instructions using the identifier (like a = 4;) in a function thus cannot be binded to the local/global identifier at parser time, since it is at that time unknown.

What are good practices to tackle this problem, what functionality does C# (standard library) provide to handle such situations?

回答1:

Disclaimer

The grammar in your question is in yacc format, so I'm assuming you are planning to use a yacc-compatible parser generator. Consequently, I have no idea what you expect from the C# standard library. This answer applies in general terms to yacc, bison and some other parser generators which use the same input format. For jison, see the note below with respect to mid-rule actions.

Standard Yacc solution

yacc and its derivatives allow you to write rules with mid-rule actions; these actions are executed before any reductions in the rest of the rule. That makes it quite easy to insert set-up and tear-down actions. For example:

function: "function"    { start_scope(); }
          '(' arguments ')'
          '{' body '}'  { end_scope(); /* ... */ }
        ;

Mid-rule actions do not add any facility to a context free grammars, since they can be written in another way (see below), but they are sometimes easier to read and write. The mid-rule action has a value like any other component of a production, and it can be referenced by a later action. It might, for example, be useful to do something like this:

function: "function"    { $$ = new_scope(); }
          '(' arguments ')'
          '{' body '}'  { close_scope($2); /* ... */ }
        ;

although even there it will be necessary to use persistent state to record the current scope if you want name lookup to be applied to the current scope during the initial parse.

Furthermore, it is possible that the addition of a mid-rule action to an (LA)LR(1) grammar will remove the (LA)LR(1) property. There is an example in the bison manual, which not entirely coincidentally has to do precisely with the question of scoping local variables.

Scoping rules

Tempting though this strategy is, there are many languages for which it will not help as much as you might expect. In javascript, for example, it is entirely legal to refer to a local variable before it is defined as such. Consider the following, for example:

a=42;
function geta() {
  var rv = a;
  var a = 43;
  return rv;
}
geta();

The result of geta is undefined because when rv is initialized from a, that particular a, which is in the scope of geta, has not yet been assigned a value. If the next line were changed to var b = 43; then geta would return 42 because now the a is a reference to the variable in the enclosing scope. Python scoping is roughly the same (although the languages differ in how local functions are scoped), but it is far from universal; there are probably more languages whose scoping rules are like Lua, which can be scoped left to right, in which the first use of a in geta refers to the outer a, while the scope of the inner a starts at its declaration and extends to the end of the block. (Of course, Lua's syntax is slightly different, so you can't just feed it that example.)

C and most C derivatives are based on a "declare-before-use" rule, but C++ does not require names used in the bodies of inline class member functions to have been declared before the function's definition unless the name is a template or typedef. So in C++ the syntactic part of name resolution can be done left-to-right and you can decide whether (foo)*(bar) is a cast or a product, but you still need to revisit the expression in order to do name resolution.

So it is often better to attach names to scopes in a subsequent walk of the AST, in which case it no longer matters that the scope isn't attached to a function until the function production is reduced.

Alternatives for parser generators without mid-rule actions

Not all LALR(1) parser generators allow mid-rule actions. Neither jison -- which otherwise uses a very similar grammar format -- nor lemon implement the feature, for example. (There is a fairly long-outstanding jison feature request.) But it doesn't matter so much, because MRAs are just syntactic sugar. Sometimes, they can be implemented with a uniquely-named production with an empty right-hand side. In other words, the rule:

X: a b c { /* mid-rule action */ } d e f { /* final action */ };

can be transformed into something like:

@32: /* empty */ { /* mid-rule action */ };
X: a b c @32 d e f { /* final action */ };

in which the mid-rule action will be executed when @32 is reduced, i.e. before the reductions of d, e and f.

Much of the time, the MRA needs to be able to reference the semantic values of the symbols to the left of it in the right-hand side, so it would be more accurate to consider the MRA to be syntactic sugar for a prefix of the rhs:

X: @32 d e f { /* final action (with semantic values renumbered) */ };
@32: a b c   { /* mid-rule action */ }

The actual bison implementation is closer to the first suggestion above, because bison knows that when @32 is reduced, the top three stack elements will always be a, b and c. That works because the created non-terminal only appears in one place in the grammar. (Bison gives you access to this stack-probing feature, but please don't use it unless you really know what you're doing.) So you're more likely to use the prefix solution if you need the equivalent to an MRA in jison or lemon.

You might still run into problems, because the final action might refer to a semantic value which has now been subsumed by the prefix rule. To deal with such a case, you might need to pass one or more semantic rules through the prefix rule by making use of its semantic value. Fortunately, these cases are rare.