Why is it so easy to decompile .NET IL-code into source code, compared to decompiling native x86 binaries? (Reflector produces quite good source code most of the time, while decompiling the output of a C++ compiler is almost impossible.)
Is it because IL contains a lot of meta data? Or is it because IL is a higher abstraction than x86 instructions? I did some research and found the following two usefull articles, but neither of them answers my question.
- MSIL Decompiler Theory
- C Decompiler - Quick primer
I think you've got the most important bits already.
- As you say, there's more metadata available. I don't know the details of what is emitted by a C or C++ compiler, but I suspect far more names and similar information are included in IL. Just look at what the decompiler knows about what's in a particular stack frame, for example - as far as the x86 is concerned, you only know how the stack is used ; in IL you know what the contents of the stack represent (or at least, the type - not the semantic meaning!)
- Again, as you've already mentioned, IL is a higher level abstraction than x86. x86 has no idea what a method or function call is, or an event, or a property etc. IL has all that information still within it.
- Typically C and C++ compilers optimise much more heavily than (say) the C# compiler. This is because the C# compiler assumes that most of the optimisation can still be performed later - by the JIT. In some ways it makes sense for the C# compiler not to try to do much optimisation, as there are various bits of information which are available to the JIT but not the C# compiler. Optimised code is harder to decompile, because it's further away from being a natural representation of the original source code.
- IL was designed to be JIT-compiled; x86 was designed to be executed natively (admittedly via micro-code). The information the JIT compiler needs is similar to that that a decompiler would want, so a decompiler has an easier time with IL. In some ways this is really just a restatement of the second point.
There are a number of things that make reverse engineering il fairly easy.
Type information. This is massive. In x86 assembler, you have to infer the types of variables based on how they are used.
structure. Information on the structure of the application is more available in il disassemblies. This, combined with type information, gives you an amazing amount of data. You're working at a pretty high level at this point (relative to x86 assembler). In native assembler, you have to infer the structure layouts (and even the fact that they are structures) based on how the data is used. Not impossible, but much more time consuming.
names. Knowing the names of things can be useful.
These things, combined, means you have quite a lot of data about the executable. Il is basically working at a level much closer to the source than a compiler of native code would be. The higher level the bytecode works at, the easier reverse engineering is, generally speaking.
C# and IL nearly map one-to-one. (This is less so with some newer C# 3.0 features.) The closeness of the mapping (and the lack of an optimizer in the C# compiler) makes things so 'reversible'.
Extending Brian's correct answer
If you think all IL is easily decompilable, I suggest writing a non-trivial F# program and attempting to decompile that code. F# does a lot of code transformations and hence has a very poor mapping from the actual emitted IL and the original code base. IMHO, it is significantly more difficult to look at decompiled F# code and get back the original program than it is for C# or VB.Net.