Minimizing Java function call overhead

I have a piece of code where it appears, in every test I've run, that function calls have a significant amount of overhead. The code is a tight loop, performing a very simple function on each element of an array (containing 4-8 million ints).

Running the code, which consists primarily of

for (int y = 1; y < h; ++y) {
    for (int x = 1; x < w; ++x) {
        final int p = y * s + x;
        n[p] = f.apply(d, s, x, y);
    }
}

executing something like

(final int[] d, final int s, final int x, final int y) -> {
    final int p = s * y + x;
    final int a = d[p] * 2
                + d[p - 1]
                + d[p + 1]
                + d[p - s]
                + d[p + s];
    return (1000 * (a + 500)) / 6000;
};

on various machines (my work laptop, a W530 with i7 3840QM, a server VM with one core of a Xeon E5-1620, and a Digital Ocean node with one core of an unknown CPU), I repeatedly get a statistically significant performance hit when calling a method vs inlining. All tests were performed on Java 1.8.0_11 (Java HotSpot(TM) 64-Bit Server VM).

Work machine:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.860        0.184    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       22.603        0.159    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       22.792        0.117    ops/s

Dedicated server, VM:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.685        0.224    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       16.077        0.113    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       23.827        0.088    ops/s

DO VPS:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       24.425        0.506    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200        9.643        0.140    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       13.733        0.134    ops/s

All acceptable performance, but I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters.

Inlining all the potential operations would be difficult, but theoretically possible. For close to a 2x performance increase, potentially worth it, but maintenance would be a nightmare.

I'm not sure if there's a reasonable way to batch up a set of repetitions; most of the operations take multiple inputs (unknown to the caller) and produce a single output.

What other options do I have for minimizing the overhead and evening out performance?

A method call is not a problem since hot methods are often inlined. A virtual call is an issue.

In your code the type profiler is fooled by the initialization method Image.random. When Image.process is JIT-compiled for the first time, it is optimized for calling random.nextInt(). So the next invocations of Image.process will result in the inline-cache miss followed by a non-optimized virtual call to Shader.apply.

Remove an Image.process call from the initialization method and JIT will then inline the useful calls to Shader.apply.
After BlurShader.apply is inlined you can help JIT to perform Common subexpression elimination optimization by replacing
```
final int p = s * y + x;
```
with
```
final int p = y * s + x;
```
The latter expression is also met in Image.process, so JIT will not calculate the same expression twice.

After applying these two changes I've achieved the ideal benchmark score:

Benchmark                           Mode   Samples         Mean   Mean error    Units
s.ShaderBench.testProcessInline    thrpt         5       36,483        1,255    ops/s
s.ShaderBench.testProcessLambda    thrpt         5       36,323        0,936    ops/s
s.ShaderBench.testProcessProc      thrpt         5       36,163        1,421    ops/s