I'm looking for a way to parse c++ code to retrieve some basic information about classes. I don't actually need much information from the code itself, but I do need it to handle things like macros and templates. In short, I want to extract the "structure" of the code, what you would show in a UML diagram.
For each class/struct/union/enum/typedef in the code base, all I need (after templates & macros have been handled) is:
- Their name
- The namespace in which they live
- The fields contained within (name of type, name of field and access restrictions, such as private/mutable/etc)
- Functions contained within (return type, name, parameters)
- The declaring file
- Line/column numbers (or byte offset in file) where the definition of this data begins
The actual instructions in the code are irrelevant for my purposes.
I'm anticipating a lot of people saying I should just use a regex for this (or even Flex & Bison), but these aren't really valid, as I do need the preprocessor and template stuff handled properly.
Sounds like a job for gcc-xml in combination with the c++ xml-library or xml-friendly scripting language of your choice.
- Elsa: The Elkhound-based C/C++ Parser,
- clang: a C language family frontend for LLVM/Clang Static Analyzer,
- ANTLR Parser Generator Grammar List (search for C++, there is more than one grammar),
- OpenC++ (adds reflection capabilities to C++),
- Stratego XT (full programs transformation - see CodeBoost, which for parsing uses OpenC++ just mentioned, for an example application to C++ programs),
- Parsing C++ at nobugs.org (not a parser but interesting bits of information; in particular Edward D. Willink's "Meta-Compilation for C++" PhD thesis and Mike Dimmick overview of his attempt to parse C++).
See also Ira Baxter here, where he cites his own product.
Warning: mind you, only Elsa "..I hear does a fairly good job.." at constructing a symbol table, which according to Ira Baxter is necessary for OP's original intent (see comments to this answer - I quote him because he is an expert in the field).
Running Doxygen on the code would give you most of that, wouldn't it?
In what format do you want the output?
Exuberant Ctags will give you most of what you need, it's usually used by editors to provide code navigation.
May choke on some templates though...
The DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. Its C++ Front End builds on DMS to provide full featured C++ parsing for a variety of common C++ dialects, can process set of C++ classes simulataneously, and constructs full name/type/access information that you can use any way you want. Information is tagged as to precise origin file/line/column. (It includes a full preprocessor).
You are right; regex can't even come close to this.
You can easily get macros expanded by just running pre-processor (cpp) on the source.
The templates are not that easy since template instantiation happens much later.
Doxygen can also produce a detailed XML by setting an option in the configuration file. It is quite thorough, and very easy to use. From the doxygen home page:
The XML output consists of a
structured "dump" of the information
gathered by doxygen. Each compound
(class/namespace/file/...) has its own
XML file and there is also an index
file called index.xml.
A file called combine.xslt XSLT script
is also generated and can be used to
combine all XML files into a single
file.
Doxygen also generates two XML schema
files index.xsd (for the index file)
and compound.xsd (for the compound
files). This schema file describes the
possible elements, their attributes
and how they are structured, i.e. it
the describes the grammar of the XML
files and can be used for validation
or to steer XSLT scripts.
In the addon/doxmlparser directory you
can find a parser library for reading
the XML output produced by doxygen in
an incremental way (see
addon/doxmlparser/include/doxmlintf.h
for the interface of the library)