Is it important to use Characteristics.UNORDERED i

2019-03-20 03:25发布

问题:

Since I use streams a great deal, some of them dealing with a large amount of data, I thought it would be a good idea to pre-allocate my collection-based collectors with an approximate size to prevent expensive reallocation as the collection grows. So I came up with this, and similar ones for other collection types:

public static <T> Collector<T, ?, Set<T>> toSetSized(int initialCapacity) {
    return Collectors.toCollection(()-> new HashSet<>(initialCapacity));
}

Used like this

Set<Foo> fooSet = myFooStream.collect(toSetSized(100000));

My concern is that the implementation of Collectors.toSet() sets a Characteristics enum that Collectors.toCollection() does not: Characteristics.UNORDERED. There is no convenient variation of Collectors.toCollection() to set the desired characteristics beyond the default, and I can't copy the implementation of Collectors.toSet() because of visibility issues. So, to set the UNORDERED characteristic I'm forced to do something like this:

static<T> Collector<T,?,Set<T>> toSetSized(int initialCapacity){
    return Collector.of(
            () -> new HashSet<>(initialCapacity),
            Set::add,
            (c1, c2) -> {
                c1.addAll(c2);
                return c1;
            },
            new Collector.Characteristics[]{IDENTITY_FINISH, UNORDERED});
}

So here are my questions: 1. Is this my only option for creating an unordered collector for something as simple as a custom toSet() 2. If I want this to work ideally, is it necessary to apply the unordered characteristic? I've read a question on this forum where I learned that the unordered characteristic is no longer back-propagated into the Stream. Does it still serve a purpose?

回答1:

First of all, the UNORDERED characteristic of a Collector is there to aid performance and nothing else. There is nothing wrong with a Collector not having that characteristic but not depending on the encounter order.

Whether this characteristic has an impact depends on the stream operations itself and implementation details. While the current implementation may not drain much advantage from it, due to the difficulties with the back-propagation, it doesn’t imply that future versions won’t. Of course, a stream which is already unordered, is not affected by the UNORDERED characteristic of the Collector. And not all stream operations have potential to benefit from it.

So the more important question is how important is it not to prevent such potential optimizations (perhaps in the future).

Note that there are other unspecified implementation details, affecting the potential optimizations when it comes to your second variant. The toCollection(Supplier) collector has unspecified inner workings and only guarantees to provide a final result of the type produced by the Supplier. In contrast, Collector.of(() -> new HashSet<>(initialCapacity), Set::add, (c1, c2) -> { c1.addAll(c2); return c1; }, IDENTITY_FINISH, UNORDERED) defines precisely how the collector ought to work and may also hinder internal optimizations of collection producing collectors of future versions.

So a way to specify the characteristics without touching the other aspects of a Collector would be the best solution, but as far as I know, there is no simple way offered by the existing API. But it’s easy to build such a facility yourself:

public static <T,A,R> Collector<T,A,R> characteristics(
                      Collector<T,A,R> c, Collector.Characteristics... ch) {
    Set<Collector.Characteristics> o = c.characteristics();
    if(!o.isEmpty()) {
        o=EnumSet.copyOf(o);
        Collections.addAll(o, ch);
        ch=o.toArray(ch);
    }
    return Collector.of(c.supplier(), c.accumulator(), c.combiner(), c.finisher(), ch);
}

with that method, it’s easy to say, e.g.

HashSet<String> set=stream
    .collect(characteristics(toCollection(()->new HashSet<>(capacity)), UNORDERED));

or provide your factory method

public static <T> Collector<T, ?, Set<T>> toSetSized(int initialCapacity) {
    return characteristics(toCollection(()-> new HashSet<>(initialCapacity)), UNORDERED);
}

This limits the effort necessary to provide your characteristics (if it is a recurring problem), so it won’t hurt to provide them, even if you don’t know how much impact it will have.