Implementing Closures in a Compiler

2019-04-26 15:24发布

I am attempting to design a basic compiler to pseudo-assembly code. However, I cannot figure out how to implement closures. It seems I would need to associate specific register values with each "subroutine". I've considered use of stacks, but once again it seems insufficient. It seems like nothing short of an associative array would work, but how could that, or something similar, be done in assembly?

The example I have chosen to attempt to represent is the following, communicated as CoffeeScript for conciseness.

((x) -> (y) -> x(y))((x) -> x)(2)

Here is the general structure I have been trying. This is a sample of the pseudo-assembly to which I am compiling.

'((label lam1)   ;; (x) -> x
  (cp resp arg)
  (ret)

  (label lam2)   ;; (y) -> x(y)
  (jmp x)

  (label lam3)   ;; (x) -> [(y) -> ...]
  (cp x arg)     ;; this is the assignment intended for a closure
  (cp resp lam2) ;; this is a returned lambda, needing the closure
  (ret)

  (label main)
  (cp arg lam1)
  (call lam3)
  (set arg 2)
  (call resp))) 

This works; however, the value is simply set under the name x and then a lambda returned, the x value could be easily polluted prior to execution of the lambda.

The description of implementation in Structure and Interpretation of Computer Programs was the following, which does not seem doable to me in assembly. I don't know what other tactics they could be using.

The procedure object will be constructed at run time by combining the current environment (the environment at the point of definition) with the entry point to the compiled procedure (a newly generated label).

In summary, How can register values be associated with "subroutines"? Could stacks be sufficient?

1条回答
乱世女痞
2楼-- · 2019-04-26 16:12

Stacks cannot be sufficient... consider a simpler case where they do

function bar(f) {
    alert(f());
}

function foo(x) {
    bar(function(){ return x; });
}

foo(42);

In the above case it would be theoretically possible for the x in the closure to live in the stack frame of foo because the closure is not going to outlive its creator foo. However with a small change:

function bar(f) {
    to_call_later.push(f);
}

the closure will be stored away and will be potentially called when foo already terminated and the stack space for its activation record has been reclaimed. Clearly x cannot be in that stack area because it must survive.

Therefore there are two problems:

  1. a closure must have some storage (environment). This is obvious when you think that calling foo twice passing two different values should create two independent storages for x. If the closure was just the code then this is not possible unless different code was going to be generated each time you call foo.

  2. this storage must live at least as long as the closure itself, not only as who creates the closure.

Note also that if you want to have read/write closed-over variables you need an extra level of indirection, for example:

function bar(f) {
    alert(f());
}

function foo(x) {
    var c1 = function() { return ++x; };
    var c2 = function() { return x *= 2; };
    bar(c1);
    bar(c2);
}

foo(42);  // displays 42+1=43 and 43*2=86 (not 42*2=84!)

in other words you can have several different closures sharing the same environment.

So x cannot be in the stack of foo activation record and it cannot be in the closure object itself. The closure object must have a pointer to where x is living.

A possible solution to implement this on say x86 is:

  • Use a garbage collected or reference-counted memory management system. Stacks are by far insufficient to handle closures.

  • Each closure is an object with two fields: a pointer to code and an array of pointers to closed-over variables (the "environment").

  • When executing code you have a stack esp and e.g. esi is pointing to the closure object itself (so (esi) is the address of the code, (esi+4) is the address of first closed-over variable, (esi+8) is the address of second closed-over variable and so on).

  • Each variable is an independent heap-allocated object that can survive as long as there are still closures pointing to it.

This is of course a very crude approach. For example SBCL is much smarter and variables that are not captured are allocated on stack and/or registers only. This requires doing an analysis of how a closure is used.

Edit

Supposing you're only considering a purely functional setting (in other words the return value of a function/closure depends only on the passed parameter and the closure state cannot mutate) then things can be simplified a little.

What you can do is making the closure object containing the captured values instead of the captured variables and by making at the same time the closure itself a copyable object then just a stack can in theory be used (except that there is the problem that a closure can vary in size depending on how much state needs to capture), so it's not easy at least for me to imagine a reasonable stack-only based protocol for parameter passing and value returning in this case.

Removing the variable size problem by making the closure a fixed-size object you can see how this C program can implement closures using only stack (note that there are no malloc calls)

#include <stdio.h>

typedef struct TClosure {
    int (*code)(struct TClosure *env, int);
    int state;
} Closure;

int call(Closure *c, int x) {
    return c->code(c, x);
}

int adder_code(Closure *env, int x) {
    return env->state + x;
}

int multiplier_code(Closure *env, int x) {
    return env->state * x;
}

Closure make_closure(int op, int k) {
    Closure c;
    c.state = k;
    c.code = (op == '+' ? adder_code : multiplier_code);
    return c;
}

int main(int argc, const char *argv[]) {
    Closure c1 = make_closure('+', 10);
    Closure c2 = make_closure('*', 3);
    printf("c1(3) = %i, c2(3) = %i\n",
           call(&c1, 3), call(&c2, 3));
    return 0;
}

Closure structs can be passed, returned and stored on stack because the environment is read-only so you don't have the lifetime problem because immutable data can be copied without affecting semantic.

A C compiler could use such an approach to create closures that can only capture variables by value, and indeed is what C++11 lambda provide (you can capture also by reference, but it's up to the programmer to ensure that the lifetime of captured variables lasts enough).

查看更多
登录 后发表回答