I am trying to instrument java byte code.
I want to recognize the entry and exit of a java loop, but I have found the identification of loops to be quite challenging. I have spent a good few hours looking at ASM and open source de-compilers (whom I thought must solve this problem all the time), however, I came up short.
The tool I am augmenting / extending, is using ASM, so ideally I would like to know how to instrument the entry and exit of the different loop constructs in java via ASM. However, I would also welcome a recommendation for a good open source de-compiler, as clearly they would have solved the same problem.
EDIT 4: A bit of background/preamble.
"The only way to jump backward in the code is via a loop." in Peter's answer isn't strictly true. You could jump back and forth without it meaning it's a loop. A simplified case would be something like this:
Of course, this particular example is very artificial and a bit silly. However, making assumptions as to how the source-to-bytecode compiler is going to behave could lead to surprises. As Peter and I have shown in our respective answers, two popular compilers can produce a rather different output (even without obfuscation). It rarely matters, because all of this tends to be optimised rather well by the JIT compiler when you execute the code. This being said, in the vast majority of cases, jumping backwards will be a reasonable indication as to where a loop starts. Compared with the rest, finding out the entry point of a loop is the "easy" part.
Before considering any loop start/exit instrumentation, you should look into the definitions of what entry, exit and successors are. Although a loop will only have one entry point, it may have multiple exit points and/or multiple successors, typically caused by
break
statements (sometimes with labels),return
statements and/or exceptions (explicitly caught or not). While you haven't given details regarding the kind of instrumentations you're investigating, it's certainly worth considering where you want to insert code (if that's what you want to do). Typically, some instrumentation may have to be done before each exit statement or instead of each successor statement (in which case you'll have to move the original statement).Soot is a good framework to do this. It has a number of intermediate representations that make bytecode analysis more convenient (e.g. Jimple).
You can build a BlockGraph based on your method body, for example an ExceptionalBlockGraph. Once you've decomposed the control flow graph into such a block graph, from the nodes, you should be able to identity the dominators (i.e. blocks that have an arrow coming back to them). This will give you the start of the loop.
You may find something similar done in sections 4.3 to 4.7 of this dissertation.
EDIT:
Following the discussion with @Peter in comments to his answer. Talking the same example:
This time, compiled with the Eclipse compiler (no specific option: simply autocompilation from within the IDE). This code hasn't been obfuscated (apart from being bad code, but that's a different matter). Here is the result (from
javap -c
):There is a loop between 3 and 12 (jumped in starting a 10) and another loop, due to the exception occurring from the division by zero at 8 to 22. Unlike the
javac
compiler result, where one could make as guess that there was an outer loop between 0 and 22 and an inner loop between 0 and 12, the nesting is less obvious here.EDIT 2:
To illustrate the kind of problems you may get with a less awkward example. Here is a relatively simple loop:
After (normal) compilation within Eclipse,
javap -c
gives this:Before doing anything within the loop, you jump straight from 2 to 15. Block 15 to 17 is the header of the loop (the "entry point"). Sometimes, the header block could contain far more instructions, especially if the exit condition involves more evaluation, or if it's a
do {} while()
loop. The concept of "entry" and "exit" of a loop may not always reflect what you'd write sensibly as Java source code (including the fact that you can rewritefor
loops aswhile
loops, for example). Usingbreak
can also lead to multiple exit points.By the way, by "block", I mean a sequence of bytecode into which you can't jump and out of which you can't jump in the middle: they're only entered from the first line (not necessarily from the previous line, possibly from a jump from somewhere else) and exited from the last (not necessarily to the following line, it can jump somewhere else too).
EDIT 3:
It seems that new classes/methods to analyse loops have been added since last time I had looked at Soot, which make it a bit more convenient.
Here is a complete example.
The class/method to analyse (
TestLoop.foo()
)When compiled by the Eclipse compiler, this produces this bytecode (
javap -c
):Here is a program that loads the class (assuming it's on the classpath here) using Soot and displays its blocks and loops:
Check the Soot documentation for more details on how to load classes. The
Body
is a model for the body of the loop, i.e. all the statements made from the bytecode. This uses the intermediate Jimple representation, which is equivalent to the bytecode, but easier to analyse and process.Here is the output of this program:
Body:
Blocks:
Loops:
LoopNestTree
usesLoopFinder
, which uses anExceptionalBlockGraph
to build the list of blocks. TheLoop
class will give you the entry statement and the exit statements. You should then be able to add extra statements if you wish. Jimple is quite convenient for this (it's close enough to the bytecode, but has a slightly higher level so as not to deal with everything manually). You can then output your modified.class
file if needed. (See the Soot documentation for this.)Are you actually building your class byte by byte? Thats pretty wild! The front page of ASM links to the Bytecode Outline plugin for Eclipse, which I assume you are using. If you click on the first image on there you will notice the code has a while loop, and you can see at least some of the byte code used to implement that loop. For reference here is that screenshot:
Direct link
Looks like loops are simply implemented as GOTO's with a boundary check. I'm talking about this line:
I'm sure the L3 marker has code for checking the index bound and decided wether to JMP. I think this is going to be quite hard for you if you want to instrument loops one byte code at a time. ASM does have the option of using a template class as the basis for you instrumentation, have you tried using it?
The only way to jump backward in the code is via a loop. So you are looking for a goto,if_icmplt etc which goes to a previous byte code instruction. Once you have found the end of the loop and where it jumps back to is the start of the loop.
Here is a complex example, from the document Bruno suggested.
The byte-code for this appears in
javap -c
asYou can see there is an inner loop between 0 and 12, a try/catch block between 0 and 15 and an outer loop between 0 and 22.