Java - Intersection of multiple collections using

2019-04-10 07:44发布

问题:

I have the following function for the unification of multiple collections (includes repeated elements):

public static <T> List<T> unify(Collection<T>... collections) {
        return Arrays.stream(collections)
               .flatMap(Collection::stream)
               .collect(Collectors.toList()); 
}

It would be nice to have a function with a similar signature for the intersection of collections (using type equality). For example:

public static <T> List<T> intersect(Collection<T>... collections) {
     //Here is where the magic happens
}

I found an implementation of the intersect function, but it doesnt use streams:

public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
    Set<T> common = new LinkedHashSet<T>();
    if (!collections.isEmpty()) {
       Iterator<? extends Collection<T>> iterator = collections.iterator();
       common.addAll(iterator.next());
       while (iterator.hasNext()) {
          common.retainAll(iterator.next());
       }
    }
    return common;
}

Is there any way to implement something similar to the unify function making use of streams? Im not so experienced in java8/stream api, because of that some advice would be really helpful.

回答1:

You can write your own collector in some utility class and use it:

public static <T, S extends Collection<T>> Collector<S, ?, Set<T>> intersecting() {
    class Acc {
        Set<T> result;

        void accept(S s) {
            if(result == null) result = new HashSet<>(s);
            else result.retainAll(s);
        }

        Acc combine(Acc other) {
            if(result == null) return other;
            if(other.result != null) result.retainAll(other.result);
            return this;
        }
    }
    return Collector.of(Acc::new, Acc::accept, Acc::combine, 
                        acc -> acc.result == null ? Collections.emptySet() : acc.result, 
                        Collector.Characteristics.UNORDERED);
}

The usage would be pretty simple:

Set<T> result = Arrays.stream(collections).collect(MyCollectors.intersecting());

Note however that collector cannot short-circuit: even if intermediate result will be an empty collection, it will still process the rest of the stream.

Such collector is readily available in my free StreamEx library (see MoreCollectors.intersecting()). It works with normal streams like above, but if you use it with StreamEx (which extends normal stream) it becomes short-circuiting: the processing may actually stop early.



回答2:

While it’s tempting to think of retainAll as a black-box bulk operation that must be the most efficient way to implement an intersection operation, it just implies iterating over the entire collection and testing for each element whether it is contained in the collection passed as argument. The fact that you are calling it on a Set does not imply any advantage, as it is the other collection, whose contains method will determine the overall performance.

This implies that linearly scanning a set and testing each element for containment within all other collections will be on par with performing retainAll for each collection. Bonus points for iterating over the smallest collection in the first place:

public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
    if(collections.isEmpty()) return Collections.emptySet();
    Collection<T> smallest
        = Collections.min(collections, Comparator.comparingInt(Collection::size));
    return smallest.stream().distinct()
        .filter(t -> collections.stream().allMatch(c -> c==smallest || c.contains(t)))
        .collect(Collectors.toSet());
}

or, alternatively

public static <T> Set<T> intersect(Collection<? extends Collection<T>> collections) {
    if(collections.isEmpty()) return Collections.emptySet();
    Collection<T> smallest
        = Collections.min(collections, Comparator.comparingInt(Collection::size));
    HashSet<T> result=new HashSet<>(smallest);
    result.removeIf(t -> collections.stream().anyMatch(c -> c!=smallest&& !c.contains(t)));
    return result;
}


回答3:

I think maybe it would make more sense to use Set instead of List (maybe that was a typo in your question):

public static <T> Set<T> intersect(Collection<T>... collections) {
     //Here is where the magic happens
     return (Set<T>) Arrays.stream(collections).reduce(
             (a,b) -> {
                 Set<T> c = new HashSet<>(a);
                 c.retainAll(b);
                 return c;
             }).orElseGet(HashSet::new);
}


回答4:

and here's a Set implementation. retainAll() is a Collection method, so it works on all of them.

public static <T> Set<T> intersect(Collection<T>... collections)
{
    return new HashSet<T>(Arrays.stream(collections).reduce(
            ((a, b) -> {
                a.retainAll(b);
                return a;
            })
    ).orElse(new HashSet<T>());
}

And with List<> if order is important.

public static <T> List<T> intersect2(Collection<T>... collections)
{
    return new ArrayList<T>(Arrays.stream(collections).reduce(
            ((a, b) -> {
                a.retainAll(b);
                return a;
            })
    ).orElse(new ArrayList<T>()));
}

Java Collections lets them look almost identical. If required, you could filter the List to be distinct as it may contain duplicates.

public static <T> List<T> intersect2(Collection<T>... collections)
{
    return new ArrayList<T>(Arrays.stream(collections).reduce(
            ((a, b) -> {
                a.retainAll(b);
                return a;
            })
    ).orElse(new ArrayList<T>())).stream().distinct());
}


回答5:

You can write it with streams as follows:

return collections.stream()
        .findFirst()        // find the first collection
        .map(HashSet::new)  // make a set out of it
        .map(first -> collections.stream()
                .skip(1)    // don't need to process the first one
                .collect(() -> first, Set::retainAll, Set::retainAll) 
        )
        .orElseGet(HashSet::new); // if the input collection was empty, return empty set

The 3-argument collect replicates your retainAll logic

The streams implementation gives you the flexibility to tweak the logic more easily. For example, if all your collections are sets, you might want to start with the smallest set instead of the first one (for performance). To do that, you would replace findFirst() with min(comparing(Collection::size)) and get rid of the skip(1). Or you could see if you get better performance with the type of data you work with by running the second stream in parallel and all you would need to do is change stream to parallelStream.