I'm trying to use Java 8's parallelStream() to execute several long-running requests (eg web requests) in parallel. Simplified example:
List<Supplier<Result>> myFunctions = Arrays.asList(() -> doWebRequest(), ...)
List<Result> results = myFunctions.parallelStream().map(function -> function.get()).collect(...
So if there are two functions that block for 2 and 3 seconds respectively, I'd expect to get the result after 3 seconds. However, it really takes 5 seconds - ie it seems the functions are being executed in sequence and not in parallel. Am I doing something wrong?
edit: This is an example. The time taken is ~4000 milliseconds when I want it to be ~2000.
long start = System.currentTimeMillis();
Map<String, Supplier<String>> input = new HashMap<String, Supplier<String>>();
input.put("1", () -> {
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
return "a";
});
input.put("2", () -> {
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
return "b";
});
Map<String, String> results = input.keySet().parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> {
return input.get(key).get();
}));
System.out.println("Time: " + (System.currentTimeMillis() - start));
}
Doesn't make any difference if I iterate over the entrySet() instead of the keySet()
edit: changing the parallel part to the following also does not help:
Map<String, String> results = input.entrySet().parallelStream().map(entry -> {
return new ImmutablePair<String, String>(entry.getKey(), entry.getValue().get());
}).collect(Collectors.toConcurrentMap(Pair::getLeft, Pair::getRight));
When executing in parallel, there is overhead of decomposing the input set, creating tasks to represent the different portions of the calculation, distributing the actions across threads, waiting for results, combining results, etc. This is over and above the work of actually solving the problem. If a parallel framework were to always decompose problems down to a granularity of one element, for most problems, these overheads would overwhelm the actual computation and parallelism would result in a slower execution. So parallel frameworks have some latitude to decide how finely to decompose the input, and that's what's happening here.
In your case, your input set is simply too small to be decomposed. So the library chooses to execute sequentially.
Try this on your four-core system: compare
vs
Here, you're giving it enough input that it will be confident it can win through parallel execution. If you measure with a responsible measurement methodology (say, the JMH microbenchmark harness), you'll probably see an almost-linear speedup between these two examples.