Compiling an AST to Assembly

2020-06-04 11:07发布

问题:

I have an abstract syntax tree that I need to convert to assembly for a virtual machine. I don't know how best to do this so I started using a chain of string templates. Pseudo-code example of what I mean, say a simplistic if-statement with one condition needs to be compiled:

std::string compile_if(Node* n) {
    std::string str = "";

    curLabel = nLabels++;

    str += compile_comparison(n->getChild(0));

    str += ".true"+curLabel+":";
    str += compile_block(n->getChild(1));

    str += ".false"+curLabel+":";

    return str;
}

Where each compile_* generates an assembly string based on the current/next AST nodes. Then the final string is run through an assembler. This seems sloppy and hard to maintain, surely this isn't what most compilers do. Is this a bad idea, should I change it? How do most other compilers generate virtual assembly code / machine code?

回答1:

Disclaimer: I only have experience with X86 machine code. Other instruction sets may have, for example, different addressing capabilities, so parts of my advice might not apply. I'm sorry that I don't have time to research instruction sets at the moment.


Well firstly, most compilers don't generate assembly as text, because it's kinda inefficient to serialise the code into assembly only to have it parsed straight away by the assembler, as you have probably realised. It is reasonable to have separate compilation and assembly phases, but not essential.

In the compilation phase, the two strategies I would consider are:

  • (a) generate the assembly as a tree / array of instruction objects which can symbolically refer to one another. In the assembly phase these need to be serialised into bytecode/machinecode. I'd recommend this method, even if it makes your compiler's architecture a little more complex.

  • (b) generate the assembly as machinecode/bytecode into a buffer, with some helper functions; in this case you don't really have a separate assembly phase. I've personally tried this method, and within the scope of a single function it's not bad, but may cause some extra difficulties by not knowing how large a function is going to be before it's assembled.

I would guess that (a) is the approach used by optimising compilers like GCC, while (b) is the approach used by high-speed compilers like TCC.


Let's again consider the if example by examining the code that an existing compiler generates for a simple if/else branch:

Note the overlapping jumps in the disassembly - one that skips the 'taken' block and one that skips the 'not-taken' block.

These are relative jumps, so in order to assemble them we need to know how many bytes of instructions are between the jump instruction and the destination.

Here's an example of what the compilation function might look like using strategy (a):

Instruction[] compile_if(IfNode n) {
    Instruction[] code;

    code ~= compile_condition(n.condition);

    Instruction skip_taken = new JumpInstruction(`jz`);
    code ~= skip_taken;

    code ~= compile_block(n.taken_block);

    Instruction skip_nottaken = new JumpInstruction(`jmp`);
    code ~= skip_nottaken;

    Instruction[] nottaken_code = compile_block(n.nottaken_block);
    skip_taken.destination = nottaken_code[0];
    code ~= nottaken_code;

    Instruction end = new NopInstruction();
    skip_nottaken.destination = end;
    code ~= end;

    return code;
};

This should be pretty self-explanatory.

Note how instructions refer to one another symbolically (skip_taken.destination = nottaken_code[0]), rather than by byte-offsets like in serialised machinecode. We leave those offset calculations for the assembler.

Also note how we set the destinations of the JumpInstructions only once they become available.

The NopInstruction at the end is just to give the skip_nottaken jump something to refer to.

Now, how do we actually assemble these jumps into real machinecode/bytecode? here's one possibility (a very basic example):

byte[2] assemble_jz(Instruction[] code, int idx) {
    // assemble the jz instruction at code[idx]

    JumpInstruction jump = code[idx];
    ++idx;

    byte jump_offset = 0;
    while (code[idx] != jump.destination) {
        jump_offset += size_of_instruction(code[idx]);
        ++idx;
    };

    byte[2] machinecode = [
        0x74, // jz short
        jump_offset
    ];
    return machinecode;
};

Because the assembler has access to all the instruction objects, it can calculate the actual offsets for relative jumps by scanning ahead until it finds the destination instruction.


I hope this brief introduction helps you get started with designing your own compiler backend. Obviously I'm not suggesting that you write your compiler exactly like my example, but it should give you some ideas of how to approach the general problem of compiling and assembling non-linear instruction blocks.

You might also want to take a look at some existing assembler APIs such as https://github.com/asmjit/asmjit .

Good luck.