Symbol Table Population after parsing; Compiler bu

After creating the parse tree, i have to populate symbol table now.

I have to store information like

Type, Scope, Offset etc for the identifiers.

Now how do i know the type, scope of the identifiers , since all i know is the lexeme value and line number for that particular ID (after lexical analysis).

How do i got about the whole thing. thanks.

回答1:

Now how do i know the type, scope of the identifiers , since all i know is the lexeme value and line number for that particular ID (after lexical analysis).

As EJP mentioned, you need to step through the parse tree. Your tree should have been created so that you can do an in-order traversal, visiting each node in the same order in which source code statements and expressions are evaluated. Your tree nodes should also correspond with a specific language construct, e.g. WhileStmtNode, MethodDeclNode, etc.

Suppose I am building the symbol table, recursively stepping through the tree, and I've just entered a method body node. I might have something like the following:

public void doAction(MethodBodyNode methodBody) {
    currScope = 2;
    methodBody.getExpr().applyAction(this);
    currScope = 2;
}

I keep a global variable to manage the scope. Every time I enter a block where the scope changes, I'd increment currScope. Similarly, I'd maintain currClass and currMethod variables to store with the symbol name, type, offset, etc. for later phases.

Update:

Now say, i am traversing the tree, everytime i come across an ID i would have to enter the value to symbol table along with the type, scope and others, say for scope i check if i come across '{' or function name, but how do i know what type of ID is this.

Each tree node should contain all the necessary information for the entire construct. If you're using a parser generator, like CUP or Bison, you can specify how to build the tree in the grammar actions. E.g.

variableDeclaration::= identifier:i identifier:i2 SEMICOLON {: RESULT = new VarDeclNode(i, i2, null); :};
identifier::= ID:i {: RESULT = new IdNode(i.getLineNum(), i.getCharNum(), i.getStringValue()); :};

Those productions would match Foo f; and append a variable declaration node to the tree. That node encapsulates two identifier nodes that contain the line number, character number, and string value of the lexeme. The first identifier node is the type and the second is the variable name. ID is a terminal symbol returned by the lexer upon matching an identifier. I'm assuming you are familiar with this to some extent.

public class VarDeclNode extends StmtNode {

    private IdNode id;
    private IdNode type;
    private ExprNode expr;

    public VarDeclNode(IdNode id, IdNode type, ExprNode expr) {
        super();
        this.id = id;
        this.type = type;
        this.expr = expr;
    }

}

When you've got a syntax tree with nodes like this, you've got all the information you need.

2nd Update:

It doesn't matter whether you're using a custom parser or a generated one, there is one distinct point where you add a node into the tree upon matching a production. And it doesn't matter what language you're using. C structs will do just fine.

if it is a non terminal has the info as Nonterminals name, and if it is a terminal i.e. a token, then the info in token i.e. lexeme value, token name and line number are stored

You must have specialized nodes in the tree, e.g. ClassNode, TypeNode, MethodDeclNode, IfStmtNode, ExprNode. You can't just store one type of node and put non-terminals and terminals in it. A non-terminal is represented as a tree node, there's no other information to store about it beside the parts that compose it, which are generally non-terminals themselves. You wouldn't store any token information. There are only but a few instances where you'd actually store a lexeme's string value: for an identifier and for a string/boolean/integer literal.

Have a look at this example. During the first reduction when S is reduced to (S + F), you append a ParenExprNode to the tree root. You also append a AddExprNode as child of the ParenExprNode. That logic must be hard-coded into your parser when applying a reduction by rule 2 of the grammar.

The tree:

    ExprNode (root)
       |
  ParenExprNode
       |
   AddExprNode
   /         \
ExprNode   ExprNode

The code:

struct ExprNode { void* exprNode; };
struct ParenExprNode { void* exprNode; };
struct AddExprNode { void* op1, * op2; };
struct IdNode { char* val; int line; int charNum; };
struct IntLiteralNode { int val; int line; int charNum; };

void reduce_rule_2(ExprNode* expr) {

    //remove stack symbols

    //create nodes
    struct ParenExprNode* parenExpr = malloc(sizeof(struct ParenExprNode));
    struct AddExprNode* addExpr = malloc(sizeof(struct AddExprNode));
    addExpr->op1 = malloc(sizeof(struct ExprNode));
    addExpr->op2 = malloc(sizeof(struct ExprNode));

    //link them
    parenExpr->exprNode = (void*)addExpr;
    expr->exprNode = (void*)parenExpr;
}

In the next step, the left parenthesis is removed from the input. Afterward, S is on top of the stack and it is reduced to F by rule 1. Since F is the non-terminal for an identifier, it's represented by IdNode.

The tree:

    ExprNode
       |
  ParenExprNode
       |
   AddExprNode
   /         \
ExprNode   ExprNode
   |
 IdNode

The code:

reduce_rule_2(addExpr->op1);

void reduce_rule_1(ExprNode* expr) {
    //reduce stack symbols
    struct IdNode* id = malloc(sizeof(struct IdNode));
    id->val = parser_matched_text();
    id->lineNum = parser_line_num();
    id->charNum = parser_char_num();
    expr->exprNode = (void*)id;
}

And so on...

回答2:

all i know is the lexeme value and line number for that particular ID

That's not true. You know where it is declared in the parse tree, which tells you everything you need. You do this step by processing the parse tree.