I was recently profiling my code and found one interesting bottleneck in it. Here is benchmark :
@BenchmarkMode(Mode.Throughput)
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
public class Contains {
private int[] ar = new int[] {1,2,3,4,5,6,7};
private int val = 5;
@Benchmark
public boolean naive() {
return contains(ar, val);
}
@Benchmark
public boolean lambdaArrayStreamContains() {
return Arrays.stream(ar).anyMatch(i -> i == val);
}
@Benchmark
public boolean lambdaIntStreamContains() {
return IntStream.of(ar).anyMatch(i -> i == val);
}
private static boolean contains(int[] ar, int value) {
for (int arVal : ar) {
if (arVal == value) {
return true;
}
}
return false;
}
}
Result :
Benchmark Mode Cnt Score Error Units
Contains.lambdaArrayStreamContains thrpt 10 22867.962 ± 1049.649 ops/s
Contains.lambdaIntStreamContains thrpt 10 22983.800 ± 593.580 ops/s
Contains.naive thrpt 10 228002.406 ± 8591.186 ops/s
If shows that Array contains operation via lambda is 10 times slower than naive implementation with simple loop. I knew that lambdas should be a bit slower. But 10 times? Am I doing wrong lambda or this is some issue with java?
Your benchmark does not actually measure
anyMatch
performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted
lambdaIntStream
benchmark since it works exactly the same way aslambdaArrayStream
.5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?
Object allocation
Now rerun the benchmark with
-prof gc
profiler. It will show the amount of heap allocations:lambdaArrayStream
allocates 152 bytes per iteration whilenaive
benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to supportanyMatch
, and each takes several nanoseconds:i -> i == val
Call stack
java.util.stream
implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack ofanyMatch
in your benchmark, you'll see something like that:Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with
-XX:MaxInlineLevel=20
the score will become a bit better:Loop optimizations
for
iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for awhile
-kind loop inforEachWithCancel
method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with-XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate
:Conclusions
There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.