Using clang to analyze C++ code

2019-04-09 15:25发布

问题:

We want to do some fairly simple analysis of user's C++ code and then use that information to instrument their code (basically regen their code with a bit of instrumentation code) so that the user can run a dynamic analysis of their code and get stats on things like ranges of values of certain numeric types.

clang should be able to handle enough C++ now to handle the kind of code our users would be throwing at it - and since clang's C++ coverage is continuously improving by the time we're done it'll be even better.

So how does one go about using clang like this as a standalone parser? We're thinking we could just generate an AST and then walk it looking for objects of the classes we're interested in tracking. Would be interested in hearing from others who are using clang without LLVM.

回答1:

clang is designed to be modular. Quoting from its page:

A major design concept for clang is its use of a library-based architecture. In this design, various parts of the front-end can be cleanly divided into separate libraries which can then be mixed up for different needs and uses.

Look at clang libraries like libast for your needs. Read more here.



回答2:

What you didn't indicate is what kind of "analyses" you wanted to do. Most C++ analyses require that you have accurate symbol table data so that when you encounter a symbol foo you have some idea what it is. (You technically don't even know what + is without such a symbol table!) You also need generic type information; if you have an expression "a*b", what is the type of the result? Having "name and type" information is key to almost anything you want to do for analysis.

If you insist on clang, then there are other answers here. I don't know it it provides for name and type resolution.

If you need name and type resolution, then another solution would the DMS Software Reengineering Toolkit. DMS provides generic compiler like infrastructure for parsing, analyzing, transforming, and un-parsing (regenerating source code from the compiler data structures). DMS's industrial-strength C++ front end (it has many other language front ends, too) provides full name and type resolution according to the ANSI standard as well a GCC and MS VC++ dialects.

Code transformations can be implemented via an abstract-syntax tree interface provided by DMS, or by pattern-directed program transformation rules written in the surface syntax of your target language (in this case, C++). Here's a simple transformation using the rule language:

    domain Cpp~GCC3;  -- says we want patterns for C++ in the GCC3 dialect

    rule optimize_to_increment(lhs:left_hand_side):expression -> expression
      " \lhs = \lhs + 1 " ->   " \lhs++"  if no_side_effects(lhs).

This implicitly operates on the ASTs built by DMS, to modify them. The conditional allows you to inquire about arbitrary properties of pattern variables (in this case, lhs), including name and type constraints if you wish.

DMS has been used many times for very sophisticated program analysis and transformation of C++ code. We build C++ test coverage tools by instrumenting C++ code in a rather obvious way using DMS. At the website, there's a bibligraphy with papers describing how DMS was used to restructure the architecture of a large product line of military aircraft mission software. This kind of activity literally pours C++ in one architectural shape into another by applying large numbers of pattern directed transforms such as the above.

It is likely to be very easy to implement your instrumentation. And you don't have to wait for it to mature.



标签: c++ clang