In code similar to each of the following examples, I would like to be able to statically analyze code to determine the list of possible values that are passed into SpecialFunction().
SpecialFunction(5); // A
int x = 5;
SpecialFunction(x); // B
int x = 5;
x = condition ? 3 : 19;
SpecialFunction(x); // C
I can already parse the C# into an abstract syntax tree and I can already handle cases like A, and I guess I could keep track of initial assignments of values to guess case B, but cases as easy as C seem to get complicated quickly.
I'm almost certain that we won't be able to solve for x in all cases statically and that's OK. I would like to know strategies for attempting it, ways to recognize when it can't be done. What if we need to include class-level fields and multi-threading? Closures? Would it help if we know that for the set X
of all possible values for x
, |X| < 50
?
From @Vladimir Perevalov's suggestion, how could the concepts in Pex be applied to finding possible values for targeted code points (rather than what Pex seems to do which is to discover code paths and values that lead to unchecked(?) exceptional cases)?
There is a project that does what you want (at least very near). It is Pex. Try looking at their docs, also you could decompile the sources and see what they do.
What you want is a both global data flow analysis ("what value assignments/side effects reach what usage points") [which requires control flow analysis as a precursor] and some kind of range analysis ("summarizing the set of values that can reach a point").
Computing data flow requires having a full C# front end, local control and data flow analysis, and then stitching those answers together into global data flow analysis.
Range analysis requires you first define how you intend to encode the set of possible values; what system of specifications is allowed? The simplest, just a set of values, tends to explode. An intermediate specification scheme would be something like the OP's single-relational-to-constant, e.g, "x < 50" . The trouble with any such limited scheme is that the richness of the set of values may cause you to get useless answers especially if there are other predicates of interest (if x is always odd, the single-relational-to-constant can only model this as "x < infinity" which is clearly not helpful. So, you want to choose a specification scheme which is complicated enough to model that kinds of values interest you. However, as your specification scheme gets more sophisticated, the machinery to infer those facts correctly get more complicated, so you can't make it too complicated.
Mostly the analysis tools available do not have such analyses, let alone exposed for you to you. PEX may indeed have such machinery; if you are lucky it is exposed, too.
Our DMS Software Reengineering Toolkit has generic parsing, symbol table building, control/data flow analysis and indeed even range analysis machinery (specification: x < k1*a+k2*b where k1 and k2 are constants, a and b are other program variables visibile where x is consumed). DMS has C#, Java, GNU C and COBOL front ends, and we have in fact instantiated this machinery for GNU C and IBM Enterprise COBOL (and partially for Java 7) by collecting (static analysis!) facts specific to those languages and feeding these facts to the generic machinery. We have not instantiated this machinery for C#, yet. But if you can't get a good answer from another source, this is likely pretty close.