I have a piece of code where it appears, in every test I've run, that function calls have a significant amount of overhead. The code is a tight loop, performing a very simple function on each element of an array (containing 4-8 million int
s).
Running the code, which consists primarily of
for (int y = 1; y < h; ++y) {
for (int x = 1; x < w; ++x) {
final int p = y * s + x;
n[p] = f.apply(d, s, x, y);
}
}
executing something like
(final int[] d, final int s, final int x, final int y) -> {
final int p = s * y + x;
final int a = d[p] * 2
+ d[p - 1]
+ d[p + 1]
+ d[p - s]
+ d[p + s];
return (1000 * (a + 500)) / 6000;
};
on various machines (my work laptop, a W530 with i7 3840QM, a server VM with one core of a Xeon E5-1620, and a Digital Ocean node with one core of an unknown CPU), I repeatedly get a statistically significant performance hit when calling a method vs inlining. All tests were performed on Java 1.8.0_11 (Java HotSpot(TM) 64-Bit Server VM).
Work machine:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.860 0.184 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 22.603 0.159 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 22.792 0.117 ops/s
Dedicated server, VM:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.685 0.224 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 16.077 0.113 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 23.827 0.088 ops/s
DO VPS:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 24.425 0.506 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 9.643 0.140 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 13.733 0.134 ops/s
All acceptable performance, but I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters.
Inlining all the potential operations would be difficult, but theoretically possible. For close to a 2x performance increase, potentially worth it, but maintenance would be a nightmare.
I'm not sure if there's a reasonable way to batch up a set of repetitions; most of the operations take multiple inputs (unknown to the caller) and produce a single output.
What other options do I have for minimizing the overhead and evening out performance?