I'm reading about AST (abstract syntax trees) but all the samples I see use expressions such as:
a + b * c
Which could be represented in a lispy like syntax as:
(+ a (* b c) )
Which will be the equivalent to:
+
/ \
a *
/ \
b c
My question is How an AST for a class in a OOPL would look like?
My naive attempt is for this Java code:
class Person {
String name;
int age;
public String toString() {
return "name";
}
}
Is:
;Hand written
(classDeclaration Person
(varDeclaration String name)
(varDeclaration int age )
(funcDeclaration String toString
(return "name")
)
)
But I'm not quite sure how close or far am I to a real AST representation.
Does it depends on the language I choose. How much detail is needed? Are those "xyzDeclaraction" needed or could be as:
(Person (String name) (int age))
Where can I see a "real" representation of an actual programming language to learn more.
AST is an abstraction of the CST (concrete syntax tree, or, parse tree). The concrete syntax tree is the tree resulting from the productions (in the grammar) used to parse the file. So your AST is basically derived from your grammar definition, but has for transformed
All in all I think the example in your post looks fine. I would probably wrap the variable declarations in a
varDeclList
and the function declaration in amethDeclList
, and the return statement in astmtList
. (See below.)One more or less "real" representation of an AST is described by Apple in his book "Modern Compiler Implementation in Java". (Resources can be found here.)
Using those classes, your program would be represented as follows:
Take a look at the Eclipse JDT AST implementation.
As first introduction you could read this tutorial, too.
OP: Where can I see a real representation of an actual programming language to learn more?
For your source text as a file Person.java:
what follows are both Concrete and Abstract Syntax Tree in an S-expression-style dump of the parser tree from our DMS Software Reengineering Toolkit, using its Java1.6 parser. All the apparant complexity is pretty much caused by the real complexity of the language (e.g., of Java itself).
The CST clearly contains more stuff (139 nodes) than the AST (54 nodes). The AST drops everything that can be automatically inferred from the grammar, given the AST. This includes removing non-value-carrying leaves, unary productions, and compressing spines caused by left or right recursive grammar rules into explicit list nodes.
A left paren signals a new subtree. Following the left paren is the name of the node type; @Java~Java1_.6 might seem unnecessary until you understand DMS can handle many languages at once, including langauges nested inside one another. The #nnnnnn is the memory address of the node. ^M means "this node has M parents and is left off when M==1. Things inside [...] are the node value. A { M } means this list node has M list-children. Each node is stamped with position information.
This is the Concrete Syntax tree (see further down for AST):
This is the AST (automatically generated by DMS from the CST):
EDIT March 2015: Here's a link to some C++ AST examples
Edit May 2015: DMS has long done Java 1.7 and 1.8, too.