How do tools like the Red Gate Ant Profiler or the Reflector convert IL into C# or VB.NET code?
I recently noticed that the Red Gate Ant Profiler does not generate the same source code that was written originally.
It generated a while
loop where I had used a foreach
.
That made me think. I opened Reflector.exe itself in Reflector but their code is mostly (not all of it) obfuscated.
Decompilers in general work by looking at the IL and constructing source code which is semantically equivalent to the IL. This can't always yield the orignal source code, because IL drops some high-level information (though not as much as machine code) and because there's usually several pieces of code that compile to the same IL. For example, a foreach
loop is equivalent to a certain kind of while
loop (one that sets up an enumerator first, then loops until the enumerator is exhausted, and advanced the enumerator at each step).
A common technique for implementing decompilation is to use something called "Interval Analysis" for identifying the extent of loops.
When combined with idiom recognition, and a pattern know as the "derived sequence of graphs", one can start with a control flow graph containing assembly language (or MSIL) and iteratively simplify it until you have a single AST (abstract syntax tree) node representing a "source level" view of the program (or method). Given the AST it's then trivial to generate source: you just then pretty print the resulting AST.
Here are some links to more information:
http://en.wikipedia.org/wiki/Interval_(graph_theory)
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.8004&rep=rep1&type=pdf
Generally, you'll never have complete fidelity between the original source and the decompiled source.
The control flow graph for a foreach loop looks like the control flow graph for a while loop. This is largely because the following code:
foreach (var x in xs) {
//body
}
is actually syntactic sugar for:
var enumerator = xs.GetEnumerator()
try {
while (enumerator.MoveNext()) {
var x = xs.Current;
//body
}
}
finally {
enumerator.dispose();
}
That is, the foreach loop is basically translated into the while loop, and then the while loop is compiled into MSIL.
In order for a decompiler to produce a for-each loop, it would have to add special support for attempting to guess when a while loop is in fact a foreach loop. Any such logic would not be perfect (both the above while loop and the foreach loop should yield the same (or very similar) MSIL code).
In some cases it would match the source you wrote, and in other cases it wouldn't.
Chances are you would probably have written a for-each loop, so from a usability perspective, erring on the side of for-each loops vs while loops is a good choice.
However, it's extra work. The decompiler writer has to set out to say "I want to add bunch of heuristics for trying to detect for-each loops".
Finally, there are a lot of things that can frustrate decompilation. For example the presence of "break" and "continue" statements can really complicate things. So can certain kinds of nesting (loops inside of switch statements, and vice-versa). They can all lead to CFG loops having more than one entry point and more than one exit point. That increases the difficulty of generating readable source code.
Usually the only way to deal with those cases is to use heuristics. Those heuristics will sometimes "do the wrong thing".
Also, even little things, like the particular expressions that are used for loop bounds can break idiom recognition. Sometimes the compiler will introduce temporary variables (that were not present in source). This may either requires extra idiom checks, or more advance techniques such as data flow analysis (live variable analysis, definition-use chains, etc).
In summary: decompilation is hard, and will never be perfect. Also, I'm sure the implementers had to consider tradeoffs. For example, does it make sense to invest in detecting for-each loops or should time be spent in decompiling lambdas and inferring LINQ queries?