I have a list myListToParse
where I want to filter the elements and apply a method on each element, and add the result in another list myFinalList
.
With Java 8 I noticed that I can do it in 2 different ways. I would like to know the more efficient way between them and understand why one way is better than the other one.
I'm open for any suggestion about a third way.
Method 1:
myFinalList = new ArrayList<>();
myListToParse.stream()
.filter(elt -> elt != null)
.forEach(elt -> myFinalList.add(doSomething(elt)));
Method 2:
myFinalList = myListToParse.stream()
.filter(elt -> elt != null)
.map(elt -> doSomething(elt))
.collect(Collectors.toList());
Don't worry about any performance differences, they're going to be minimal in this case normally.
Method 2 is preferable because
it doesn't require mutating a collection that exists outside the lambda expression,
it's more readable because the different steps that are performed in the collection pipeline are written sequentially (first a filter operation, then a map operation, then collecting the result),
(for more info on the benefits of collection pipelines, see Martin Fowler's excellent article)
you can easily change the way values are collected by replacing the Collector
that is used. In some cases you may need to write your own Collector
, but then the benefit is that you can easily reuse that.
I agree with the existing answers that the second form is better because it does not have any side effects and is easier to parallelise (just use a parallel stream).
Performance wise, it appears they are equivalent until you start using parallel streams. In that case, map will perform really much better. See below the micro benchmark results:
Benchmark Mode Samples Score Error Units
SO28319064.forEach avgt 100 187.310 ± 1.768 ms/op
SO28319064.map avgt 100 189.180 ± 1.692 ms/op
SO28319064.mapWithParallelStream avgt 100 55,577 ± 0,782 ms/op
You can't boost the first example in the same manner because forEach is a terminal method - it returns void - so you are forced to use a stateful lambda. But that is really a bad idea if you are using parallel streams.
Finally note that your second snippet can be written in a sligthly more concise way with method references and static imports:
myFinalList = myListToParse.stream()
.filter(Objects::nonNull)
.map(this::doSomething)
.collect(toList());
One of the main benefits of using streams is that it gives the ability to process data in a declarative way, that is, using a functional style of programming. It also gives multi-threading capability for free meaning there is no need to write any extra multi-threaded code to make your stream concurrent.
Assuming the reason you are exploring this style of programming is that you want to exploit these benefits then your first code sample is potentially not functional since the foreach
method is classed as being terminal (meaning that it can produce side-effects).
The second way is preferred from functional programming point of view since the map function can accept stateless lambda functions. More explicitly, the lambda passed to the map function should be
- Non-interfering, meaning that the function should not alter the source of the stream if it is non-concurrent (e.g.
ArrayList
).
- Stateless to avoid unexpected results when doing parallel processing (caused by thread scheduling differences).
Another benefit with the second approach is if the stream is parallel and the collector is concurrent and unordered then these characteristics can provide useful hints to the reduction operation to do the collecting concurrently.
If you use Eclipse Collections you can use the collectIf()
method.
MutableList<Integer> source =
Lists.mutable.with(1, null, 2, null, 3, null, 4, null, 5);
MutableList<String> result = source.collectIf(Objects::nonNull, String::valueOf);
Assert.assertEquals(Lists.immutable.with("1", "2", "3", "4", "5"), result);
It evaluates eagerly and should be a bit faster than using a Stream.
Note: I am a committer for Eclipse Collections.
I prefer the second way.
When you use the first way, if you decide to use a parallel stream to improve performance, you'll have no control over the order in which the elements will be added to the output list by forEach
.
When you use toList
, the Streams API will preserve the order even if you use a parallel stream.
There is a third option - using stream().toArray()
- see comments under why didn't stream have a toList method. It turns out to be slower than forEach() or collect(), and less expressive. It might be optimised in later JDK builds, so adding it here just in case.
assuming List<String>
myFinalList = Arrays.asList(
myListToParse.stream()
.filter(Objects::nonNull)
.map(this::doSomething)
.toArray(String[]::new)
);
with a micro-micro benchmark, 1M entries, 20% nulls and simple transform in doSomething()
private LongSummaryStatistics benchmark(final String testName, final Runnable methodToTest, int samples) {
long[] timing = new long[samples];
for (int i = 0; i < samples; i++) {
long start = System.currentTimeMillis();
methodToTest.run();
timing[i] = System.currentTimeMillis() - start;
}
final LongSummaryStatistics stats = Arrays.stream(timing).summaryStatistics();
System.out.println(testName + ": " + stats);
return stats;
}
the results are
parallel:
toArray: LongSummaryStatistics{count=10, sum=3721, min=321, average=372,100000, max=535}
forEach: LongSummaryStatistics{count=10, sum=3502, min=249, average=350,200000, max=389}
collect: LongSummaryStatistics{count=10, sum=3325, min=265, average=332,500000, max=368}
sequential:
toArray: LongSummaryStatistics{count=10, sum=5493, min=517, average=549,300000, max=569}
forEach: LongSummaryStatistics{count=10, sum=5316, min=427, average=531,600000, max=571}
collect: LongSummaryStatistics{count=10, sum=5380, min=444, average=538,000000, max=557}
parallel without nulls and filter (so the stream is SIZED
):
toArrays has the best performance in such case, and .forEach()
fails with "indexOutOfBounds" on the recepient ArrayList, had to replace with .forEachOrdered()
toArray: LongSummaryStatistics{count=100, sum=75566, min=707, average=755,660000, max=1107}
forEach: LongSummaryStatistics{count=100, sum=115802, min=992, average=1158,020000, max=1254}
collect: LongSummaryStatistics{count=100, sum=88415, min=732, average=884,150000, max=1014}
May be Method 3.
I always prefer to keep logic separate.
Predicate<Long> greaterThan100 = new Predicate<Long>() {
@Override
public boolean test(Long currentParameter) {
return currentParameter > 100;
}
};
List<Long> sourceLongList = Arrays.asList(1L, 10L, 50L, 80L, 100L, 120L, 133L, 333L);
List<Long> resultList = sourceLongList.parallelStream().filter(greaterThan100).collect(Collectors.toList());
If using 3rd Pary Libaries is ok cyclops-react defines Lazy extended collections with this functionality built in. For example we could simply write
ListX myListToParse;
ListX myFinalList = myListToParse.filter(elt -> elt != null)
.map(elt -> doSomething(elt));
myFinalList is not evaluated until first access (and there after the materialized list is cached and reused).
[Disclosure I am the lead developer of cyclops-react]