I have a piece of code where it appears, in every test I've run, that function calls have a significant amount of overhead. The code is a tight loop, performing a very simple function on each element of an array (containing 4-8 million int
s).
Running the code, which consists primarily of
for (int y = 1; y < h; ++y) {
for (int x = 1; x < w; ++x) {
final int p = y * s + x;
n[p] = f.apply(d, s, x, y);
}
}
executing something like
(final int[] d, final int s, final int x, final int y) -> {
final int p = s * y + x;
final int a = d[p] * 2
+ d[p - 1]
+ d[p + 1]
+ d[p - s]
+ d[p + s];
return (1000 * (a + 500)) / 6000;
};
on various machines (my work laptop, a W530 with i7 3840QM, a server VM with one core of a Xeon E5-1620, and a Digital Ocean node with one core of an unknown CPU), I repeatedly get a statistically significant performance hit when calling a method vs inlining. All tests were performed on Java 1.8.0_11 (Java HotSpot(TM) 64-Bit Server VM).
Work machine:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.860 0.184 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 22.603 0.159 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 22.792 0.117 ops/s
Dedicated server, VM:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.685 0.224 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 16.077 0.113 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 23.827 0.088 ops/s
DO VPS:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 24.425 0.506 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 9.643 0.140 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 13.733 0.134 ops/s
All acceptable performance, but I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters.
Inlining all the potential operations would be difficult, but theoretically possible. For close to a 2x performance increase, potentially worth it, but maintenance would be a nightmare.
I'm not sure if there's a reasonable way to batch up a set of repetitions; most of the operations take multiple inputs (unknown to the caller) and produce a single output.
What other options do I have for minimizing the overhead and evening out performance?
A method call is not a problem since hot methods are often inlined. A virtual call is an issue.
In your code the type profiler is fooled by the initialization method
Image.random
. WhenImage.process
is JIT-compiled for the first time, it is optimized for callingrandom.nextInt()
. So the next invocations ofImage.process
will result in the inline-cache miss followed by a non-optimized virtual call toShader.apply
.Remove an
Image.process
call from the initialization method and JIT will then inline the useful calls toShader.apply
.After
BlurShader.apply
is inlined you can help JIT to perform Common subexpression elimination optimization by replacingwith
The latter expression is also met in
Image.process
, so JIT will not calculate the same expression twice.After applying these two changes I've achieved the ideal benchmark score: