I'm thinking about doing some static analysis project over C++ code samples, as opposed to entire programs. In general static analysis requires some simpler intermediate representation, but such a representation cannot be accurately created without the entire program code.
Still, I know there is such a tool for Java - it basically "guesses" missing information and thus allows static analysis to take place even though it's no longer sound or complete.
Is there anything similar that can be used to convert partial C++ code into some intermediate form (e.g. LLVM bytecode)?
Understand 4 C++ by SciTools is a product that parses source code, and provides metrics for various things. As a tool the product is sort of like a source code browser, But I personally don't use it for that since visual studio's Intellisense is just as good.
Its real power is that it comes with a C and Perl API. Thus using that you can write your own static analysis tools. And yes, it will deal quite well with missing code files. Also, understand 4 C++ works on Windows and a bunch of other operating systems.
As to your last question about intermediate code, Understand 4 C++ doesn't provide you with an 'intermediate' form, but with its API, it does provide you with an abstraction layer over the abstract syntax tree that gives you a lot of power to analyze source code. I have written a lot of tools at my work using this API, and a managed C++ API (which I wrote and shared publicly on codeplex) that wraps its native C API.
As a general rule, if you guess, you guess wrong; any complaints from a static analyzer based on such guesses are false positives and will tend to cause a high rate of rejection.
If you insist on guessing, you'll need a tool that can parse arbitrary C++ fragments. ("Guess a static analysis of this method...."). Most C++ parsers will only parse complete source files, not fragments.
You'll also need a way to build up partial symbol tables. ("I is listed as an argument to FOO, but has no type information, and it is not the same I as as is declared in the statement following the call to FOO").
Our DMS Software Reengineering Toolkit with its C++ Front End can provide parsing of fragments, and might be used as a springboard for partial symbol tables.
DMS provides general parsing/analysis/transformation on code, as determined by an explicit langauge definition provided to DMS. The C++ Front End provides a full, robust C++ front end enabling DMS to parse C++, build ASTs, and build up symbol tables for such ASTs using an Attribute Grammar (AG) in which the C++ lookup rules are encoded. The AG is a functional-style computation encoded over AST nodes; the C++ symbol table builder is essence big functional program whose parts are attached to BNF grammar rules for C++.
As part of the generic parsing machinery, given a langauge definition (such as the C++ front end), DMS can parse arbitrary (non)terminals of that language using its built-in pattern langauge. So DMS can parse expressions, methods, declarations, etc. or any other well-formed code fragment and build ASTs. Where a non-wellformed fragment is provided, one currently gets a syntax error on the fragment parse; it would be possible to extend DMS's error recovery to generate a plausabile AST fix and thus parse arbitrary elements.
The partial symbol table is harder, since much of the symbol table building machinery depends on other parts of the symbol table being built. However, since this is all coded as an AG, one could run the part of the AG relevant to the fragment parsed, e.g., the symobl table building logic for a method. The AG would need to be modified probably extensively to allow it to operate with "assumptions" about missing symbol definitions; these would in effect become constraints. Of course, a missing symbol might be any of several things, and you might end up with configurations of possible symbol tables. Consider:
Not knowing what T is, the type of the phrase (and even its syntactic category) can't be uniquely determined. (DMS will parse the T*X; and report an ambiguous parse since there are multiple possible matching interpretations, see Why can't C++ be parsed with a LR(1) parser?)
We've already done some work this partial parsing and partial symbol tables, in which we used DMS experimentally to capture code containing preprocessor conditionals, with some conditional status undefined. This causes us to build conditional symbol table entries. Consider:
With conditional symbols, this code can type check. The symbol table entry for X says something like, "X ==> int if foo else ==> void(int)".
I think the idea of reasoning about large program fragments with constraints is great, but I suspect it is really hard, and you'll forever being trying to resolve enough information about a constraint into order to do static analysis.
You can check this:
What open source C++ static analysis tools are available?
this also refers to same question, and some solutions are provided there. Those may b helpful!
dont know about LLVM bytecode, but there is an old adage called PcLint
http://www.gimpel.com/html/index.htm
they even have an online testing module, where you can post portions of code