-->

How to parse C++ source in Python?

2020-07-04 06:58发布

问题:

We want to parse our huge C++ source tree to gain enough info to feed to another tool to make diagrams of class and object relations, discern the overall organization of things etc.

My best try so far is a Python script that scans all .cpp and .h files, runs regex searches to try to detect class declarations, methods, etc. We don't need a full-blown analyzer to capture every detail, or some heavy UML diagram generator - there's a lot of detail we'd like to ignore and we're inventing new types of diagrams. The script sorta works, but by gosh it's true: C++ is hard to parse!

So I wonder what tools exist for extracting the info we want from our sources? I'm not a language expert, and don't want something with a steep learning curve. Something we low-brow blue-collar programmer grunts can use :P

Python is preferred as one of the standard languages here, but it's not essential.

回答1:

I'll simply recommend Clang.

It's a C++ library-based compiler designed with ease of reuse in mind. It notably means that you can use it solely for parsing and generating an Abstract Syntax Tree. It takes care of all the tedious operator overloading resolution, template instantiation and so on.

Clang exports a C-based interface, which is extended with Python Bindings. The interface is normally quite rich, but I haven't use it. Anyway, contributions are welcome if you wish to help extending it.



回答2:

You could check out GccXML and OpenC++, as well as doxygen.



回答3:

Can you run a preprocessing step? Doxygen parses most C++ syntax and creates xml with all the relationships. Compilers also create debug databases (typically dwarf format from gcc and codeview format from MSC).



回答4:

From what you say of our requirements, Tony's answer of GccXML will probably be the best option. If that doesn't work, you could try to generate an outline of your program with cscope or ctags, and then work your way to the info you want from it's output.



回答5:

You asked for tools that can extract information from C++.

Our DMS Software Reengineering Toolkit is configurable compiler technology for building custom analyzers. It has a full C++ Front End with a preprocesser, full C++ parsing with AST construction (including capture of comments), and full symbol table. These could be used to extract such structural information, and export it to whatever you want to process it.

EDIT: One of the comments is that there are only 3 full C++ parsers in the world. I suspect more; surely IBM has one that works. DMS's C++ front end has been used in anger on large applications in both MS Visual Studio and on GNU C++ source codes, so it might reasonably qualify, too :-}



回答6:

I've had good experience with PLY:

http://www.dabeaz.com/ply/

But this requires some experience with lex and yacc



回答7:

If you can bring yourself to run this analysis using a Windows-platform application, save yourself a lot of time and trouble, and spend $200 on Enterprise Architect by Sparx Systems (I have no affiliation with this company, just a satisfied customer). (Note: this should not be confused with Microsoft's own "Enterprise Architect" bundle for Visual Studio.)

EA can reverse-engineer a number of languages, including C++, C, Java, and Python, generating some very nice UML class diagrams. (EA comes in a number of different packages, Desktop is the cheapest but you have to by Professional, the 2nd cheapest, to get the code engineering feature included.) I also like the integration between the generated class diagrams and sequence diagramming, where you can drag a line between object lifelines and a menu of defined methods is presented to you based on the class definition of the target object. At my former consulting business, we used this tool quite a bit to develop system architectural proposals which we then included as part of our project bid (just copy/paste the diagram into a Word doc). It wont take long to make back your $200.