I'm currently learning about parsers. I've been watching videos and trying to write code, but I'm starting to have a hard time understanding. I thought maybe understand the motivation for a parser could help to understand how they work and how they should be built.
So the goal of a parser is to take a string of tokens and create a parse tree. I can understand what a parse tree is, but I just don't see how it's used. Ultimately, a compiler uses a parse tree to create machine code, but exactly how does it do this? Can somebody show me an example?
What else is parsing (and parse trees) used for?
Imagine you want to make a language for math expressions. The user might input
(3 + 4) * 36
The compiler would create a parse tree for that input that looks like
*
/ \
+ 36
/ \
3 4
A simple compiler could generate machine code to evaluate this math expression by recursively walking through the tree, emitting the children's instructions, then its own. Aka a post-order traversal. The order of instructions emitted in this case would be:
- Instructions for putting the number 3 on the stack
- Instructions for putting the number 4 on the stack
- Instructions for popping the top 2 elements off the stack, adding them, and putting the result on the stack
- Instructions for putting the number 36 on the stack
- Instructions for popping the top 2 elements on the stack, multiplying them, and putting the result on the stack.
This code compiles the tree in exactly that way. When you run the program it prints the MIPS assembly instructions needed to evaluate the expression.
#include <stdio.h>
enum TYPE {
ADD,
MULTIPLY,
NUMBER
};
struct tree {
enum TYPE type;
int number_val;
struct tree* left;
struct tree* right;
};
void emit(struct tree* node);
void emitNumber(struct tree* node) {
// load the 32-bit number into a register
printf("lui $t0, %d\n", (node->number_val) & 0xFFFF0000);
printf("ori $t0, $t0, %d\n", (node->number_val) & 0x0000FFFF);
// put the number on the stack
puts("addi $sp, $sp, -4");
puts("sw $t0, 0($sp)");
}
void emitAdd(struct tree* node) {
emit(node->left);
emit(node->right);
// pop the left and right args off the stack and put them in registers
puts("lw $t0, 0($sp)");
puts("addi $sp, $sp, +4");
puts("lw $t1, 0($sp)");
puts("addi $sp, $sp, +4");
// add them and put the result on the stack
puts("add $t2, $t0, $t1");
puts("addi $sp, $sp, -4");
puts("sw $t2, 0($sp)");
}
void emitMult(struct tree* node) {
emit(node->left);
emit(node->right);
// pop the left and right args off the stack and put them in registers
puts("lw $t0, 0($sp)");
puts("addi $sp, $sp, +4");
puts("lw $t1, 0($sp)");
puts("addi $sp, $sp, +4");
// multiply them and put the result on the stack
puts("mul $t2, $t0, $t1");
puts("addi $sp, $sp, -4");
puts("sw $t2, 0($sp)");
}
void emit(struct tree* node) {
if (node == NULL) {
return;
}
switch (node->type) {
case NUMBER:
emitNumber(node);
break;
case ADD:
emitAdd(node);
break;
case MULTIPLY:
emitMult(node);
break;
}
}
int main() {
// create an example tree
struct tree three = { NUMBER, 3, NULL, NULL };
struct tree four = { NUMBER, 4, NULL, NULL };
struct tree thirtysix = { NUMBER, 36, NULL, NULL };
struct tree add = { ADD, 0, &three, &four };
struct tree mult = { MULTIPLY, 0, &add, &thirtysix };
emit(&mult);
// put the calculated result in register $t0
puts("lw $t0, 0($sp)");
}
You can test the output on a MIPS simulator like MARS or SPIM. At the end, register $t0
holds the result 252
which is the answer!
To make a compiler for a full fledged language it would need more types of nodes in the tree and more emit functions. You would also need to put some thought into how to save/restore variables on the stack during function calls. You would also want the compiler to work across architectures. There are a few solutions for this... you could emit bytecode that runs on a virtual machine like how Python or Java or C# does. Or you could compile to an intermediate assembly like Clang does with LLVM, that goes through another compilation stage to target the exact architecture.
Hopefully this gives you some sense of how easy it is to traverse a tree to generate actual instructions, and why you would prefer this tree representation over text.