How to merge two ASTs?

2019-04-29 15:12发布

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.

I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(

Here is a snippet that illustrates how I modified my perser to get rules printed:

parser grammar CodeTableParser;

options {
    tokenVocab = CodeTableLexer;
    backtrack = true;
    output = AST;
}

@header {
    package ch.bsource.ice.parsers;
}

@members {
    private void log(ParserRuleReturnScope rule) {
        System.out.println("Rule: " + rule.getClass().getName());
        System.out.println("    getStart(): " + rule.getStart());
        System.out.println("    getStop(): " + rule.getStop());
        System.out.println("    getTree(): " + rule.getTree());
    }
}

parse
    : codeTabHeader codeTable endCodeTable eof { log(retval); }
    ;

codeTabHeader
    : comment CodeTabHeader^ { log(retval); }
    ;

...

2条回答
地球回转人心会变
2楼-- · 2019-04-29 15:48

Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).

You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.

Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.

查看更多
smile是对你的礼貌
3楼-- · 2019-04-29 15:53

The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an @after block.

I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:

script    
    @init {log("@init", retval);}
    @after {log("@after", retval);}
    : statement* EOF  {log("after last rule reference", retval);} 
        -> ^(STMTS statement*) 
    ;

Parsing test input produced the following output:

Logging from @init
    getStart(): [@0,0:4='Print',<10>,1:0]
    getStop(): null
    getTree(): null
Logging from after last rule reference
    getStart(): [@0,0:4='Print',<10>,1:0]
    getStop(): null
    getTree(): null
Logging from @after
    getStart(): [@0,0:4='Print',<10>,1:0]
    getStop(): [@4,15:15='<EOF>',<-1>,1:15]
    getTree(): STMTS

The call in the after block has both the stop and tree fields populated.

I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

查看更多
登录 后发表回答