Here is a sample code:
public class TestIO{
public static void main(String[] str){
TestIO t = new TestIO();
t.fOne();
t.fTwo();
t.fOne();
t.fTwo();
}
public void fOne(){
long t1, t2;
t1 = System.nanoTime();
int i = 10;
int j = 10;
int k = j*i;
System.out.println(k);
t2 = System.nanoTime();
System.out.println("Time taken by 'fOne' ... " + (t2-t1));
}
public void fTwo(){
long t1, t2;
t1 = System.nanoTime();
int i = 10;
int j = 10;
int k = j*i;
System.out.println(k);
t2 = System.nanoTime();
System.out.println("Time taken by 'fTwo' ... " + (t2-t1));
}
}
This gives the following output:
100
Time taken by 'fOne' ... 390273
100
Time taken by 'fTwo' ... 118451
100
Time taken by 'fOne' ... 53359
100
Time taken by 'fTwo' ... 115936
Press any key to continue . . .
Why does it take more time (significantly more) to execute the same method for the first time than the consecutive calls?
I tried giving -XX:CompileThreshold=1000000
to the command line, but there was no difference.
There are several reasons. The JIT (Just In Time) compiler may not have run. The JVM can do optimizations that differ between invocations. You're measuring elapsed time, so maybe something other than Java is running on your machine. The processor and RAM caches are probably "warm" on subsequent invocations.
You really need to make multiple invocations (in the thousands) to get an accurate per method execution time.
The issues mentioned by Andreas and the unpredictability of JIT are true, but still one more issue is the class loader:
The first call to fOne
differs radically from the latter ones, because that is what makes the first call to System.out.println
, which means that is when the class loader will from disk or file system cache (usually it's cached) all the classes that are needed to print the text. Give the parameter -verbose:class
to the JVM to see how many classes are actually loaded during this small program.
I've noticed similar behavior when running unit tests - the first test to call a big framework takes much longer (in case of Guice about 250ms on a C2Q6600), even though the test code would be the same, because the first invocation is when hundreds of classes are loaded by the class loader.
Since your example program is so short, the overhead probably comes from the very early JIT optimizations and the class loading activity. The garbage collector will probably not even start before the program has ended.
Update:
Now I found a reliable way to find out what is really taking the time. Nobody had yet found out it, although it's closely related to class loading - it was dynamic linking of native methods!
I modified the code as follows, so that the logs would show when the tests begin and end (by looking when those empty marker classes are loaded).
TestIO t = new TestIO();
new TestMarker1();
t.fOne();
t.fTwo();
t.fOne();
t.fTwo();
new TestMarker2();
The command for running the program, with the right JVM parameters that show what is really happening:
java -verbose:class -verbose:jni -verbose:gc -XX:+PrintCompilation TestIO
And the output:
* snip 493 lines *
[Loaded java.security.Principal from shared objects file]
[Loaded java.security.cert.Certificate from shared objects file]
[Dynamic-linking native method java.lang.ClassLoader.defineClass1 ... JNI]
[Loaded TestIO from file:/D:/DEVEL/Test/classes/]
3 java.lang.String::indexOf (166 bytes)
[Loaded TestMarker1 from file:/D:/DEVEL/Test/classes/]
[Dynamic-linking native method java.io.FileOutputStream.writeBytes ... JNI]
100
Time taken by 'fOne' ... 155354
100
Time taken by 'fTwo' ... 23684
100
Time taken by 'fOne' ... 22672
100
Time taken by 'fTwo' ... 23954
[Loaded TestMarker2 from file:/D:/DEVEL/Test/classes/]
[Loaded java.util.AbstractList$Itr from shared objects file]
[Loaded java.util.IdentityHashMap$KeySet from shared objects file]
* snip 7 lines *
And the reason for that time difference is this: [Dynamic-linking native method java.io.FileOutputStream.writeBytes ... JNI]
We can also see, that the JIT compiler is not affecting this benchmark. There are only three methods which are compiled (such as java.lang.String::indexOf
in the above snippet) and they all happen before the fOne
method is called.
In addition to JITting, other factors could be:
- The process's output stream blocking when you call System.out.println
- Your process getting scheduled out by another process
- The garbage collector doing some work on a background thread
If you want to get good benchmarks, you should
- Run the code you're benchmarking a large number of times, several thousand at least, and calculate the average time.
- Ignore the times of the first several calls (due to JITting, etc.)
- Disable the GC if you can; this may not be an option if your code generates a lot of objects.
- Take the logging (println calls) out of the code being benchmarked.
There are benchmarking libraries on several platforms that will help you do this stuff; they can also calculate standard deviations and other statistics.
The most likely culprit is the JIT (just-in-time) hotspot engine. Basically the first time code is executed the machine code is "remembered" by the JVM and then reused on future executions.
I think it is because the second time the generated code was already optimized, after the first run.
As has been suggested, JIT could be the culprit, but so could I/O wait time as well as resource wait time if other processes on the machine were using resources at that moment.
The moral of this story is that micrbenchmarking is a hard problem, especially for Java. I don't know why you're doing this, but if you're trying to choose between two approaches for a problem, don't measure them this way. Use the strategy design pattern and run your entire program with the two different approaches and measure the whole system. That makes little bumps in processing time even out over the long run, and gives you a much more realistic view of how much the performance of your entire app is bottlenecked at that point (hint: it's probably less than you think.)
Well the most probably answer is initialization. JIT is for sure not the right answer as it takes a lot more cycles before it starts to optimize. But at the very first time there can be:
- looking up classes (is cached so no second lookup needed)
- loading classes (once loaded stays in memory)
- getting additional code from the native library (native code is cached)
- finally it loads the code to be executed in the L1 cache of the CPU. That is the most propable case for speedup in your sense and at the same time a reason why the benchmark (being a microbenchmark) does not say much. If your code is small enough the second invocation of a loop can be run completely from inside the CPU which is fast. In the real world this does not happen because programs are bigger and the reusage of the L1 cache is far from being that big.