Why can’t I observe the same performance improveme

I was testing two different approaches (primes() and primesOpt()) to collect the first N prime numbers using the Java 8 IntStream. I took these examples from chapter 6 of Java 8 in Action. You can get the source code from this gist Primes.java and this pom.xml to build it with Maven and JMH integration. (you can copy pom.xml to project folder and Primes.java to src\main\java\primes and build it with the command: mvn clean install. After that you can run the benchmark with: java -jar target\benchmarks.jar).

The first example (primes() method) is a straightforward algorithm to collect N prime numbers into a List<Integer>. And the second one (primesOpt() method) is an enhanced approach, which only tests divisions by previous prime numbers.

I test both implementations with JMH to calculate a List<Integer> of prime numbers to the maximum of 10,000:

@Benchmark
public int testPrimes() {
    return primes(10_000).size();
}

@Benchmark    
public int testPrimesOpt() {
    return primesOpt(10_000).size();
}

And I got different speedups depending on the JVM architectures. In JVM 64 bits I observe a speedup of 25% for primesOpt() over the standard version primes(), whereas for JVM 32 bits there is no speedup.

Results for JRE 1.8.0_91-b14 64-Bit:

Benchmark              Mode  Cnt    Score    Error  Units
Primes.testPrimes     thrpt   50  269,278 ± 15,922  ops/s
Primes.testPrimesOpt  thrpt   50  341,861 ± 25,413  ops/s

Results for JRE 1.8.0_91-b14 32-Bit:

Benchmark              Mode  Cnt    Score   Error  Units
Primes.testPrimes     thrpt  200  105,388 ± 2,741  ops/s
Primes.testPrimesOpt  thrpt  200  103,015 ± 2,035  ops/s

These tests were performed on machine with a dual-core Intel I7 Cpu, with hyperthreading, resulting in 2 cores and 4 hardware threads. Furthermore, the system has 4GB of RAM. The JVM version used was the 1.8.0_91-b14, running on Windows 7. The benchmark was executed with the minimum heap size of 1024MB (corresponding to -Xms1024M). During the measurements there was no other activities running.

Do you have any idea why can’t I observe the same performance improvement on JVM 32 bits for the optimized version of the primes algorithm?

primes() method implementation:

public static boolean isPrime(int n) {
    int root = (int) Math.sqrt(n);
    return IntStream
        .rangeClosed(2, root)
        .noneMatch(div -> n%div == 0);
}
public static List<Integer> primes(int max) {
    return IntStream
        .range(2, max)
        .filter(Primes::isPrime)
        .collect(ArrayList::new, ArrayList::add, ArrayList::addAll);
}

primesOpt() method implementation:

public static boolean isPrimeOpt(List<Integer> primes, int n) {
    int root = (int) Math.sqrt(n);
    return takeWhile(primes, root)
        .stream()
        .noneMatch(div -> n%div == 0);
}

public static List<Integer> takeWhile(List<Integer> src, int max) {
    int i;
    for(i = 0; i < src.size() && src.get(i) <= max; i++) {}
    return src.subList(0, i);
}

public static List<Integer> primesOpt(int max) {
    ArrayList<Integer> res = new ArrayList<>();
    return IntStream
        .range(2, max)
        .filter(n -> Primes.isPrimeOpt(res, n))
        .collect(() -> res, ArrayList::add, (l1, l2) -> {});
}

I can’t reproduce your results, but generally, performance can differ significantly depending on environmental factors. In your code, the takewhile approach forces processing of boxed Integer values whereas the non-optimized variant isPrime is processing int values only.

That trade-off should pay off the more primes you request, i.e. if scanning the first 10_000 numbers shows ambivalent results, try 100_000 or 1_000_000. The boxing overhead is linear at worst, a decent JVM might turn it into a sub-linear or even constant overhead, whereas the improvement of limiting the divisions to the actual primes should be higher than linear as the density of primes descends with higher numbers.

So perhaps the 64Bit JVM you have used has a higher overhead when processing boxed values, but I assume, that testing with a higher max will also reveal an advantage for your optimized variant—unless that JVM knows a trick to reduce the cost of division operations significantly.

But it shouldn’t be ignored that your optimized variant is broken in several ways. You are passing the supplier () -> res to collect which violates the contract as it doesn’t return a new container on each evaluation and there’s an interference between the Collector and the Predicate used in the preceding filter step.

This suggest that trying to optimize your Stream based solution perhaps doesn’t lead to somewhere. Compare with the following straight-forward approach:

public static List<Integer> primesBest(int max) {
    BitSet prime=new BitSet();
    prime.set(1, max>>1);
    for(int i=3; i<max; i+=2)
        if(prime.get((i-1)>>1))
            for(int b=i*3; b<max; b+=i*2) prime.clear((b-1)>>1);
    return IntStream.concat( IntStream.of(2),
        prime.stream().map(i->i+i+1)).boxed().collect(Collectors.toList());
}

It avoids all division and boxing operations at the “disadvantage” of not using Stream operations for the value selection, but only for the creation of the final List<Integer>. On my machine, it is about 10 times faster than your optimized variant for 10_000 elements, 50 times faster for 1_000_000 elements. That suggests that performance differences in the order of 10%, 20% or even factor two or three are not worth any discussion.

I don’t see how this algorithm can be expressed using the Stream API though. The bottom line might be that not every operation benefits from the Stream API.