Decompose expression in base operation: ANTLR + St

I am trying to write a translator for a Java like language to multiple languages.

Right now I am facing 2 problems:

First is to decompose complex expression in a sequence of basic operation and then translating them to destination language.

For example my starting language can be:

var a = (ln(b) + avg(c))*2

I'd like to traslate it like:

var x1 = log_N(b);
var x2 = average(c);
var x3 = sum(x1, x2);
var a = multiply(x3, 2);

I think i have to use a Tree parser, nut I am not sure how to integrate it with StringTemplate. Moreover i am adding extra variables like x1, x2, and x3 and i don't know how to handle this situation.

The second problem is that one of my destination language is a plsql like language. In this case a need to ensure that all output variables become cursors and pass them to the functions.

For example the expression:

var a = (ln(b) + avg(c))*2

Should be translated in this way:

log_N(x1, b);
average(x2, c);
sum(x3, x1, x2);
multiply(a, x3, 2);

where x1, x2, x3 and a will become output cursors.

Can anyone help me finding the right solution?

Thanks

标签： translation antlr grammar stringtemplate

1条回答

beautiful°

2楼-- · 2019-07-01 15:58

I think i have to use a Tree parser, but I am not sure how to integrate it with StringTemplate.

Integrating StringTemplate into a tree parser is basically the same as integrating it into a token parser: define the output option as template, then write the rule productions accordingly.

Below is a small tree grammar that uses templates for output. Note that the only meaningful difference between this grammar and the one I described in a previous answer is that this one operates on tree nodes rather than tokens. The templates work the same.

tree grammar AstToTemplateParser;

options { 
    output = template;
    tokenVocab = JavaLikeToAst;
    ASTLabelType = CommonTree;
}


program
    : ^(PROGRAM decls+=decl+) -> write(text={$decls})
    ;

decl
    : ^(DECL ID ^(op args+=arg*)) -> assign(name={$ID.text}, op={$op.st}, args={$args})
    | ^(DECL ID ^(CALL method args+=arg*)) -> assign(name={$ID.text}, op={$method.st}, args={$args})
    ;

arg 
    : ID -> write(text={$ID.text})
    | INT -> write(text={$INT.text})
    ;

method
    : ID -> op(name={$ID.text})
    ;

op  : STAR  -> op(name={$STAR.text})
    | DIV   -> op(name={$DIV.text})
    | PLUS  -> op(name={$PLUS.text})
    | MINUS -> op(name={$MINUS.text})
    ;

Moreover i am adding extra variables like x1, x2, and x3 and i don't know how to handle this situation.

That's the kicker. The easiest way that I know of (which isn't terribly easy) is to first generate the baseline AST using the token parser then use a tree parser to flatten the AST into a list of declarations, each representing expressions from the original input -- x1, x2, and x3 in your question.

The intermediate stages look like this:

Original Input

var a = (ln(b) + avg(c))*2

Baseline AST

baseline AST

Flattened AST

flattened AST

Standard Template Applied

 var var0 = log_N(b);
 var var1 = average(c); 
 var var2 = add(var0, var1);
 var a = multiply(var2, 2);

PLSQL Template Applied

  log_N(var0, b);
  average(var1, c);
  add(var2, var0, var1);
  multiply(a, var2, 2);

Here's a token parser grammar that simply generates a baseline AST that resembles the one in your question. There's nothing really noteworthy here, it just produces a typical AST.

grammar JavaLikeToAst;

options { 
    output = AST;
}

tokens { 
    PROGRAM; DECL; CALL; 
}

compilationUnit : statement* EOF -> ^(PROGRAM statement*);
statement       : decl;
decl            : VAR ID EQ expr -> ^(DECL ID expr);
expr            : add_expr;
add_expr        : mul_expr ((PLUS|MINUS)^ mul_expr)*;
mul_expr        : call_expr ((STAR|DIV)^ call_expr)*;
call_expr       : ID LPAR arglist? RPAR -> ^(CALL ID arglist?)
                | primary_expr;
arglist         : expr (COMMA! expr)*;
primary_expr    : ID | INT | LPAR! expr RPAR!; 

VAR     : 'var';
ID      : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
INT     : ('0'..'9')+;
COMMA   : ',';
SEMI    : ';';
LCUR    : '{';
RCUR    : '}';
LPAR    : '(';
RPAR    : ')';
EQ      : '=';
PLUS    : '+';
MINUS   : '-';
STAR    : '*';
DIV     : '/';
WS      : (' '|'\t'|'\f'|'\r'|'\n'){skip();};

Here is where it gets ugly (at least the way I choose to do it ;)). The tree grammar below transforms a baseline AST produced from above into a list of declarations that correspond to the expressions of the AST -- the flattened AST. Below this, I'll step through the fun bits of the grammar.

tree grammar AstToFlatAstParser;

options { 
    output = AST;
    tokenVocab = JavaLikeToAst;
    ASTLabelType = CommonTree;
    filter = true;
}

@header { 
    import java.util.HashMap;
}

@members {
    private int currentId = 0;
    private HashMap<Integer, Object> exprs = new HashMap<Integer, Object>();
    private boolean newDecls = false;

    private int nextId() { 
        return currentId++;
    }

    private Object generateId(int id) { 
        return adaptor.create(ID, "var" + id);
    }  

    private void saveExpr(int id, Object expr){
        newDecls = true;
        exprs.put(id, expr);
    }

    private Object buildNewDecls() {
        Object newDecls = adaptor.nil();

        for (int i = 0; i < currentId; ++i){
            if (!exprs.containsKey(i)){
                continue; //This id was generated but not used.
            }

            Object expr = exprs.get(i);
            Object decl = adaptor.create(DECL, tokenNames[DECL]);
            adaptor.addChild(decl, adaptor.create(ID, "var" + i));
            adaptor.addChild(decl, expr);
            adaptor.addChild(newDecls, decl);
        }

        exprs.clear();

        return newDecls;
    }
}

bottomup
    : exit_program
    | exit_op
    ;

exit_op
    @init {
        int myId = nextId();
    }
    : ^(binary_op reduced reduced)
        {$start.parent != null && $start.parent.getType() != DECL}? 
        {saveExpr(myId, $start);} 
        -> {generateId(myId)}
    | ^(CALL ID .*) 
        {$start.parent != null && $start.parent.getType() != DECL}? 
        {saveExpr(myId, $start);} 
        -> {generateId(myId)}
    ;   

binary_op       : STAR | DIV | PLUS | MINUS;

reduced         : ID | INT; 

exit_program
    //Only rebuild PROGRAM if a new declaration is going to be built, that is, when "newDecls" is true.
    //Otherwise PROGRAM is considered changed when it isn't and the processing never ends.
    : {newDecls}? ^(PROGRAM old+=.*) {newDecls = false;} 
        -> ^(PROGRAM {buildNewDecls()} $old*)
    ;

First, notice that the grammar is mostly Java code. There are only five parser rules, and most of them are simple. This is a filter tree grammar, so rules bottomup and topdown are the entry points. Only bottomup is necessary in this case, so topdown isn't specified. Rule bottomup is repeated until the output tree is unchanged, which to us means when there are no more declarations to produce and the tree is fully flattened.

Second, notice that rule exit_program is where the new declarations are being written out to the AST. I'm using a semantic predicate ({newDecls}?) to ensure that PROGRAM is only modified when new declarations are added. Remember how I said bottomup is called until no more changes are made? Without this semantic predicate, exit_program will always modify PROGRAM and the tree parsing will never stop processing bottomup. This is a crude work-around for this special case but it works. The new declarations are inserted at the beginning of PROGRAM to ensure that they appear before they are referenced. It's no good defining x1 ten lines after it's expected.

Third, notice that rule exit_op replaces expressions (like ln(b)) with declarations (like var0). An expression is replaced if one of the following is true:

The expression is a binary operation whose operands are both "reduced" (that is, they're both integers or variable ids) and is not the child of a DECL node. var a = 1 + 2 is not changed because 1 + 2 is the child of a declaration. var b = a + (2 + c) is changed because (2 + c) has two "reduced" operands and is not the child of a DECL node (it's the child of the + in a + ...).
The expression is a CALL that is not the child of a DECL node. var a = ln(b) is untouched, but var a = ln(b) + 3 is changed because ln(b) is a child of +.

An expression is stored in exprs before it's replaced by an id. It's reconstituted in the exit_program rule when the rule calls buildNewDecls. buildNewDecls just uses the parser's built-in TreeAdaptor member (named adaptor) to generate the DECL nodes that appear in the flattened AST. The Javadoc for the adaptor methods does an adequate job explaining what the calls do, so I won't go into the details.

Caveat: The parsers produced by the grammars above work fine for the very limited case that you've presented. I don't know what errors they will produce when applied to any broader scenario.

The second problem is that one of my destination language is a plsql like language. In this case a need to ensure that all output variables become cursors and pass them to the functions...

That would be something that the templates could manage for you once your AST is just a flat list of declarations, as shown above.

You'll pass the flattened AST to a template-based tree parser, like the one up top, to produce the different text versions like those you've listed. In this case, a template would accept all the parts of a declaration – the variable name, the operation/method name, and the operands/arguments – and produce text like either variable = method(arg0, arg1) or method(variable, arg0, arg1), depending on the template used. The key is to ensure that the input is flat and that the template receives everything related to the declaration.

Here is a test application that ties it all together.

JavaLikeToAstTest.java

public class JavaLikeToAstTest {

    public static void main(String[] args) throws Exception {

        final String code = "var a = (ln(b) + avg(c))*2";

        CharStream input = new ANTLRStringStream(code);
        JavaLikeToAstLexer lexer = new JavaLikeToAstLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);

        JavaLikeToAstParser parser = new JavaLikeToAstParser(tokens);

        JavaLikeToAstParser.compilationUnit_return result = parser
                .compilationUnit();

        if (lexer.getNumberOfSyntaxErrors() > 0 || parser.getNumberOfSyntaxErrors() > 0) {
            throw new Exception("Syntax Errors encountered!");
        }

        CommonTree tree = (CommonTree) result.tree;

        System.out.printf("Baseline AST: %s%n%n", tree.toStringTree());

        tree = flatten(tree);

        System.out.printf("Flattened AST: %s%n%n", tree.toStringTree());

        translate(tree, "AstToPlsql.stg");
        translate(tree, "AstToGlobal.stg");
    }

    private static CommonTree flatten(CommonTree tree) {
        AstToFlatAstParser parser = new AstToFlatAstParser(
                new CommonTreeNodeStream(tree));
        return (CommonTree) parser.downup(tree, true);
    }

    private static void translate(CommonTree tree, String templateResourceName)
            throws Exception {
        AstToTemplateParser parser = new AstToTemplateParser(
                new CommonTreeNodeStream(tree));
        InputStream stream = JavaLikeToTemplateTest.class
                .getResourceAsStream(templateResourceName);
        Reader reader = new InputStreamReader(stream);
        parser.setTemplateLib(new StringTemplateGroup(reader));
        reader.close();
        stream.close();

        System.out.printf("Result for %s%n%n%s%n%n", templateResourceName,
                parser.program().st.toString());

    }

Here are two simple StringTemplate group files to handle the translation process.

AstToGlobal.stg

group AstToGlobal;

methods ::= ["*":"multiply", "/":"divide", "+":"add", "-":"subtract", "avg":"average", "ln":"log_N", default:key]

assign(name, op, args) ::= <<var <name> = <op>(<args;separator=", ">) >>

op(name) ::= "<methods.(name)>"

write(text) ::= << <text;separator="\n"> >>

AstToPlsql.stg

group AstToPlsql;

methods ::= ["*":"multiply", "/":"divide", "+":"add", "-":"subtract", "avg":"average", "ln":"log_N", default:key]

assign(name, op, args) ::=<< <op>(<name>, <args;separator=", ">) >>

op(name) ::= "<methods.(name)>"

write(text) ::= << <text;separator="\n"> >>

The application produces the following output:

Baseline AST: (PROGRAM (DECL a (* (+ (CALL ln b) (CALL avg c)) 2)))

(CALL ln b) -> var0
(CALL avg c) -> var1
(+ var0 var1) -> var2
(PROGRAM (DECL a (* var2 2))) -> (PROGRAM (DECL var0 (CALL ln b)) (DECL var1 (CALL avg c)) (DECL var2 (+ var0 var1)) (DECL a (* var2 2)))
Flattened AST: (PROGRAM (DECL var0 (CALL ln b)) (DECL var1 (CALL avg c)) (DECL var2 (+ var0 var1)) (DECL a (* var2 2)))

Result for AstToPlsql.stg

  log_N(var0, b ) 
  average(var1, c ) 
  add(var2, var0 , var1 ) 
  multiply(a, var2 , 2 )  

Result for AstToGlobal.stg

 var var0 = log_N(b ) 
 var var1 = average(c ) 
 var var2 = add(var0 , var1 ) 
 var a = multiply(var2 , 2 )

There is no code in AstToTemplate.g or the templates to handle simple assignments like var a = 3, but I think it will be easy enough to add the code to handle it using the op/method assignment as a guide.

0人赞添加讨论(0) 举报