General stategy for designing Flexible Language ap

2019-07-24 14:36发布

问题:

Requirement:

I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations

  • Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
  • Any number of transformations:
    • if a loop can be expanded, we go ahead and expand it
    • If we can eliminate dead code we might choose to do that
    • we might choose to rename all variable names to conform to some norms.

Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.

e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.

Questions:

  1. I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
  2. Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
  3. Is it possible to work just on the parse tree?
  4. Are there similar language applications which do the same thing? If so what strategy do they use?

Any input is welcome. Thanks in advance.

回答1:

In general, what you want is called a Program Transformation System (PTS).

PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.

Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.

More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.



标签: antlr dsl antlr4