I am creating a compiler in OCaml where the grammar is as follow:
type expr =
| Cons of const
| Var of string
| List of ( expr list )
| Sum of ( expr * expr )
| Less_than of ( expr * expr )
| Conditional of ( expr * expr * expr )
| Array_literal of ( expr )
| Array_read of ( expr * expr )
The node of an AST looks like this:
type 'a astNode =
{
data: 'a;
metadata: Metadata;
}
and the Metadata module looks like this:
module Metadata = struct
type loc = Lexing.position
type loc_range = loc * loc
and metadata = ?
end
What should the grammar for metadata be? What would my code look like after the line and metadata = ?
Basically, when would I be required to update the AST with the metadata information. How should I structure my AST to contain the metadata information?
Metadata for me currently means it's position such as line number, file name etc. This is encompassed in the Lexing.position module.
This is more a software design question and there are a few common solutions. I will try to cover them all.
The most common solution is to wrap your expression type into a record that has, besides the expression payload, some metadata. The type of metadata could be made abstract or concrete, it is also a matter of taste. Abstract type for metadata makes it easier to extend it later. The classical approach is implemented in OCaml compiler itself, refer to the Locations module, it will tell you how to get location information in classical OCaml Yacc/Menhir parsers.
A slightly different approach is to index the tree and attach an identifier to each AST node. Then you can use external maps (like any associative container) to add arbitrary metadata to your AST. Such approach is used in BAP Primus Lisp Parser. The nice thing is that this approach is that it makes it easy to combine it with hash-consing. It also makes the set of attributes that you can associate with your nodes easily extensible, with the cost the mapping is now partial. (Or common choice between a record a mapping).
Instead of storing indexes directly in the nodes, you may pick some particular method for numbering your nodes (e.g., DFS numbering). With this approach you do not need to modify your AST type, that is especially nice if you are not controlling it. An example of such approach is the Positions module from the Janestreet's
parsexp
library that implements the compact set of positions based on some traversal (not the DFS numbering). Unfortunately, their implementation is not general enough to be reused with different AST, but the approach is general.